mcanouil / eggla

Early Growth Genetics Longitudinal Analysis
https://m.canouil.dev/eggla/
Other
2 stars 1 forks source link

Error in run_eggla_gwas when use_info=TRUE #80

Closed annihei closed 2 years ago

annihei commented 2 years ago

Bug description

I'm trying run_eggla_gwas again, and managed to run it with argument use_info=FALSE. With the following command

run_eggla_gwas(
  data = "selected_cleaned_PCs_NFBC1966.csv",
  results = results_archives,
  id_column = "ID",
  traits = c("slope_.*", "auc_.*"),
  covariates = c("sex", "PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
    path = gen_file_path,
    pattern = "\\.vcf$|\\.vcf.gz$",
    full.names = TRUE
  ),
  working_directory = work_dir_path,
  use_info = TRUE,
  build = "37",
  strand = "+",
  vep = NULL,
  bin_path = list(
    bcftools = "/usr/local/bin/bcftools",
    plink2 = "/usr/local/bin/plink2"
  ),
  threads = 1
)

I get error and message: Formatting VCFs ... [chr1.dose.vcf.gz] Performing PLINK2 regression ... [chr1.dose.vcf.gz] Extracting INFO ... [chr1.dose.vcf.gz] Combining results files ... [chr1.dose.vcf.gz] Writing annotated results file ... Error in colnamesInt(x, neworder, check_dups = FALSE) : argument specifying columns specify non existing column(s): cols[21]='INFO' Calls: run_eggla_gwas ... resolve.list -> signalConditionsASAP -> signalConditions Execution halted srun: error: localhost: task 0: Exited with exit code 1

Is there something that is missing in the arguments, or in my vcf files? I don't catch where it's coming from.

$ bcftools view -h "genetic_files/chr20.dose.vcf.gz"
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=GENOTYPED,Description="Site was genotyped">
##FILTER=<ID=GENOTYPED_ONLY,Description="Site was genotyped only">
##filedate=2016.10.6
##source=Minimac3
##contig=<ID=20>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##bcftools_viewVersion=1.4+htslib-1.4
##bcftools_viewCommand=view -h chr20.dose.vcf.gz; Date=Thu May 17 10:11:34 2018
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view -h genetic_files/chr20.dose.vcf.gz; Date=Mon Oct 24 10:39:04 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ...

eggla version output

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Locale:
  LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

Package version:
  backports_1.4.1      bayestestR_0.13.0    beeswarm_0.4.0      
  BH_1.75.0.0          bit_4.0.4            bit64_4.0.5         
  broom_1.0.1          broom.mixed_0.2.9.4  cli_3.4.1           
  clipr_0.8.0          coda_0.19.4          codetools_0.2.18    
  colorspace_2.0.3     compiler_4.2.1       cpp11_0.4.3         
  crayon_1.5.2         curl_4.3.3           data.table_1.14.4   
  datawizard_0.6.3     digest_0.6.30        distributional_0.3.1
  doParallel_1.0.17    dplyr_1.0.10         eggla_0.17.3        
  ellipsis_0.3.2       fansi_1.0.3          farver_2.1.1        
  forcats_0.5.2        foreach_1.5.2        furrr_0.3.1         
  future_1.28.0        generics_0.1.3       ggbeeswarm_0.6.0    
  ggdist_3.2.0         ggplot2_3.3.6        ggtext_0.1.2        
  globals_0.16.1       glue_1.6.2           graphics_4.2.1      
  grDevices_4.2.1      grid_4.2.1           gridtext_0.1.5      
  growthcleanr_2.0.0   gtable_0.3.1         haven_2.5.1         
  HDInterval_0.2.2     hms_1.1.2            insight_0.18.6      
  isoband_0.2.6        iterators_1.0.14     jpeg_0.1.9          
  labeling_0.4.2       labelled_2.10.0      lattice_0.20.44     
  lifecycle_1.0.3      listenv_0.8.0        magrittr_2.0.3      
  markdown_1.2         MASS_7.3.53          Matrix_1.3.3        
  methods_4.2.1        mgcv_1.8.35          mime_0.12           
  munsell_0.5.0        nlme_3.1.152         numDeriv_2016.8.1.1 
  parallel_4.2.1       parallelly_1.32.1    patchwork_1.1.2     
  performance_0.10.0   pillar_1.8.1         pkgconfig_2.0.3     
  plyr_1.8.7           png_0.1.7            prettyunits_1.1.1   
  progress_1.2.2       purrr_0.3.5          R6_2.5.1            
  RColorBrewer_1.1.3   Rcpp_1.0.9           readr_2.1.3         
  rlang_1.0.6          scales_1.2.1         splines_4.2.1       
  stats_4.2.1          stringi_1.7.8        stringr_1.4.1       
  tibble_3.1.8         tidyr_1.2.1          tidyselect_1.2.0    
  tools_4.2.1          tzdb_0.3.0           utf8_1.2.2          
  utils_4.2.1          vctrs_0.5.0          vipor_0.4.5         
  viridisLite_0.4.1    vroom_1.6.0          withr_2.5.0         
  xfun_0.34            xml2_1.3.3          

Checklist

mcanouil commented 2 years ago

I hardcoded the imputation quality field in INFO as "INFO" while it could also be "R2". I'll release a patch today.

mcanouil commented 2 years ago

This should be fixed now, can you try to install the development version?

remotes::install_github("mcanouil/eggla")

PS: do not forget to provide the proper information for the imputation quality metrics by changing the default value for info_type, i.e., info_type = "IMPUTE2 info score via 'bcftools +impute-info'", in your case it's something about "R2".

annihei commented 2 years ago

Thanks, got the 0.17.4 and running the analysis now.

mcanouil commented 2 years ago

Ok, let me know if there is still an issue or if it goes smoothly, and in that case I'll release v0.17.4..

annihei commented 2 years ago

Will do, with use_info=TRUE it took about 2,5 days to finish.

mcanouil commented 2 years ago

I'll keep the issue open until you can confirmed it is fixed then.

annihei commented 2 years ago

This worked! I quickly looked at the results, seems like everything else is OK I just need to add CALL_RATE.

mcanouil commented 2 years ago

Great! I'll start the release process for v0.17.4.

I don't think the CALL_RATE is really needed though. In theory, your data have been QCed, this implies a call rate threshold, thus the column will be between the threshold and one, which makes it not really useful for any postprocessing step.