saigegit / SAIGE

Development for SAIGE and SAIGE-GENE(+)
GNU General Public License v3.0
65 stars 28 forks source link

Step2: Error in setVCFobjInCPP #30

Open SheaCheng2000 opened 2 years ago

SheaCheng2000 commented 2 years ago

Hi,

I am using set-based test for burden analysis (and I also use the group file with weights).

At step2 this error happened:

Error in setVCFobjInCPP(vcfFile, vcfFileIndex, vcfField, t_SampleInModel = sampleInModel) : At least one subject requested is not in VCF file. Calls: SPAGMMATtest -> setGenoInput -> setVCFobjInCPP Execution halted

It seems that my VCF is not in correct format? But I have checked the VCF file and it seems to be identical with the example. I really don't know what the problem is.

The log file and my VCF file example are provided here.

step2_input.vcf.txt step2.log

Looking forward for your reply!

weizhouUMICH commented 2 years ago

Hi, It means that some samples used in step 1 are not in the VCF file. Can you have a check?

Thanks, Wei

SheaCheng2000 commented 2 years ago

That's exactly the problem! Thank you so much! Although no more errors are reported, there is a new problem--the log of step2 shows lines of:

indexChunk is 0 indexChunk 0 nRegions 18426 Read in 100 region(s) from the group file. The chunk is empty Read in 100 region(s) from the group file. The chunk is empty ... Read in 100 region(s) from the group file. The chunk is empty [1] "Analysis done! ...

and the output files are empty.

My group file does not contain all variants in the VCF file, and I am wondering if that's the reason?

Thanks again!

weizhouUMICH commented 2 years ago

Hi, Are the marker IDs in the group file in the format chr:pos:ref:alt? for VCF files, has --chrom been specified to be exactly the same as the chromosome string in the files. Thanks, Wei

SheaCheng2000 commented 2 years ago

Hi, My marker IDs in the previous group file are in the format of chr:pos_ref/alt, which is the same as example file "group_new_chrposa1a2_withWeights.txt". I have changed it into the format of chr:pos:ref:alt. groupfile.txt

For --chrom, I use strings like "chr1,chr2...,chr22", which is consistent with the format of my VCF file.

However, it shows the same warning ("The chunk is empty").

Here are my command lines:

Rscript step2_SPAtests.R \ --vcfFile=./input_mrkh/mrkh_2022Mar.split.hard.AB.VQSR.QDSOR.samFlt.renew.vep.rmRepeat.recode.vcf.GT.gz \ --vcfFileIndex=./input_mrkh/mrkh_2022Mar.split.hard.AB.VQSR.QDSOR.samFlt.renew.vep.rmRepeat.recode.vcf.GT.gz.csi \ --vcfField=GT \ --SAIGEOutputFile=./output_mrkh/mrkh_2022Mar.split.hard.AB.VQSR.QDSOR.samFlt.renew.vep.rmRepeat.recode.vcf_groupTest_out.txt \ --LOCO=FALSE \ --chrom=chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22 \ --minMAF=0 \ --minMAC=0.5 \ --sampleFile=./input_mrkh/mrkh_step2_sample_list.txt\ --GMMATmodelFile=./output_mrkh/mrkh_2022Mar_binary_NULLGLMM.rda \ --sparseGRMFile=output_mrkh/mrkh_2022Mar.sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx \ --sparseGRMSampleIDFile=output_mrkh/mrkh_2022Mar.sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \ --groupFile=./input_mrkh/mrkh_weightfile_colon.txt \ --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous \ --maxMAF_in_groupTest=0.0001,0.001,0.01

Thanks! Shea

SheaCheng2000 commented 2 years ago

Hi Wei,

While I am still struggling with "The chunk is empty", I found if I do not use "--chrom", other errors will be reported. For example, my VEP annotation in the INFO field is invalid ( I don't know if long INFO annotation will have some bad influence on file reading??). And I am wondering why it does not report the issue with "--chrom" absense?

Thanks! Shea

weizhouUMICH commented 2 years ago

Hi Shea,

--chrom is only needed for single-variant assoc tests with VCF files are used and --chrom can only take one chromosome string for each job. For set-based tests, the program takes the chromosome information directly from the markers. That is to say, currently, the program assumes all markers in each set are from the same chromosome, which makes the leave-one-chromosome-out feature easier. I'm not quite sure about what causes the error for the INFO field. Since you are using GT in VCF file, can you convert the file to Plink files for jobs? Please use alt-first in the PLINK program.

Thanks, Wei

weizhouUMICH commented 2 years ago

Hi Shea,

You are right. I'm sorry for the confusion. The format in the file group_new_chrposa1a2_withWeights.txt has not been updated.

Thanks, Wei

SheaCheng2000 commented 2 years ago

Hi Wei,

Thank you so much for your quick response. It really helps a lot!

Now I change vcf file into plink file, and no more errors are reported. Now I think the only problem for me is about the WEIGHT in the group file (also this is what I am most concerned about in my burden analysis)-- When I use group file without weight, the analysis can be done successfully, BUT when I add weight in the group file, it reports:

Start extracting marker-level information from 'groupFile' of ./input_mrkh/mrkh_weightfile_colon_singlesymbol_rmdup_nochr.txt .... [1] "weights are included for markers" indexChunk is 0 indexChunk 0 nRegions 17982 36 markers in 'RegionFile' are not in 'GenoFile'. Read in 100 region(s) from the group file. [1] "Analyzing Region A1BG (1/17982)." Error in mainRegionInCPP(genoType, region$genoIndex_prev, region$genoIndex, : Not compatible with requested type: [type=character; target=double]. Calls: SPAGMMATtest -> SAIGE.Region -> mainRegionInCPP

I am confused about why it happens.

Hope you have a nice weekend!

Thanks, Shea

SheaCheng2000 commented 2 years ago

Hi Wei,

One Addition to my last question :)

I tried to run the example code with and without weight, and found the same error as mine.

When I ran this code without weighted group file (exactly the same as is provided in the tutorial: https://saigegit.github.io//SAIGE-doc/docs/set_step2.html), no error was reported.

Rscript step2_SPAtests.R \ --bedFile=./input/genotype_100markers.bed \ --bimFile=./input/genotype_100markers.bim \ --famFile=./input/genotype_100markers.fam \ --SAIGEOutputFile=./output/genotype_100markers_plink_groupTest_out.txt \ --chrom=1 \ --LOCO=TRUE \ --AlleleOrder=alt-first \ --minMAF=0 \ --minMAC=0.5 \ --sampleFile=./input/samplelist.txt \ --GMMATmodelFile=./output/example_binary_fullGRM.rda \ --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt \ --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx \ --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \ --groupFile=./input/group_new_chrposa1a2.txt \ --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous \ --maxMAF_in_groupTest=0.0001,0.001,0.01

But when I changed the group file into "group_new_chrposa1a2_withWeights.txt", error was reported! group_new_chrposa1a2_withWeights.txt

Error in mainRegionInCPP(genoType, region$genoIndex_prev, region$genoIndex, : Not compatible with requested type: [type=character; target=double]. Calls: SPAGMMATtest -> SAIGE.Region -> mainRegionInCPP

Full log infomation: log_SetbasedTest_Step2_withWeight.log

Looking forward to your reply :)

Thanks, Shea

saigegit commented 2 years ago

Hi Shea,

Thanks for trying the examples. I've fixed the issue in the new version 1.0.7. And the new example script can be found https://saigegit.github.io//SAIGE-doc/docs/set_step2.html (#7).

Btw, please check the additional message about using the sparse GRM in Step 2(WARNING before example #1) at the same doc page https://saigegit.github.io//SAIGE-doc/docs/set_step2.html.

Thanks, Wei

SheaCheng2000 commented 2 years ago

Hi Wei,

Thank you so much for the update!!! This week I have been trying to download your new version, but when I executed "R CMD INSTALL SAIGE", error happened (mainly the URL open error or fork error). I think it is because my Linux server is not stable enough these days. So I want to know if I can download the SAIGE R-package somewhere in my PC and then upload it to Linux server? (ps. unfortunately I cannot use docker because of some authority limit...)

Thanks, Shea

saigegit commented 2 years ago

Hi Shea,

I just attached a compressed file as a release v1.0.9. You may directly download the file from https://github.com/saigegit/SAIGE/releases/ using the command wget https://github.com/saigegit/SAIGE/releases/download/v1.0.9/SAIGE_1.0.9.tar.gz

Then run the command to install from the source code R CMD INSTALL SAIGE_1.0.9.tar.gz

Thanks, Wei

SheaCheng2000 commented 2 years ago

Hi Wei,

Thank you! I have downloaded SAIGE v1.0.9 successfully.

When I run step2 test with weighted groupfile using my own data, it reports such error:

...
[1] "Analyzing Region ACOX1 (248/17982)."
[1] "Analyzing Region ACOX2 (249/17982)."
[1] "Analyzing Region ACOX3 (250/17982)."
[1] "Analyzing Region ACOXL (251/17982)."
[1] "Analyzing Region ACP1 (252/17982)."
Error in if (abs(q - m1)/sqrt(var1) < Cutoff) { : 
   missing value where TRUE/FALSE needed
 Calls: SPAGMMATtest ... SAIGE.Region -> get_newPhi_scaleFactor -> <Anonymous>
 Execution halted

I found a similar issue here https://github.com/saigegit/SAIGE/issues/12. Also, the places that error arises depend on --annotation_in_groupTest. When I changed -annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous into -annotation_in_groupTest=lof, no error happens.

My group file is attached here: mrkh_weighted_groupfile.txt

Another question is that when I add --r.corr=1 to perform only burden test, error arises:

Error in getopt_options(object, args) : Error in getopt(spec = spec, opt = args) : "–-r.corr=1" is not a valid option, or does not support an argument Calls: parse_args -> parse_options -> getopt_options

Thanks again for your kind attention! Shea

saigegit commented 2 years ago

Hi Shea,

Thanks for reporting the issue! I will take a look.

For the burden only test, can you please have a check to see the the Step 2 wrapper file step2_SPAtests.R is the most updated version? https://github.com/saigegit/SAIGE/blob/main/extdata/step2_SPAtests.R

and try the example 4

https://saigegit.github.io//SAIGE-doc/docs/set_step2.html#step-2-performing-the-region--or-gene-based-association-tests

can you have a check to see if the v1.0.9 is used? The first few lines in the log file contain the information/

Loading required package: RhpcBLASctl
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)

Matrix products: default
BLAS/LAPACK: /stanley/genetics/users/wzhou/.conda/envs/RSAIGE/lib/libopenblasp-r0.3.20.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.14.2      optparse_1.7.1         RhpcBLASctl_0.21-247.1
[4] SAIGE_1.0.9

Thanks, Wei

SheaCheng2000 commented 2 years ago

Hi Wei,

I have followed your suggestion and --r.corr works normally! I think maybe something went wrong in my previous command or environment and thus caused the error. Sorry about that!

Moreover, after adding --r.corr=1 , the first error ( Error in if (abs(q - m1)/sqrt(var1) < Cutoff) )did NOT happen anymore when I used --annotation_in_groupTest=lof,missense:lof,missense:lof:synonymous! So I am wondering if some issue still exist in SKAT-O, just as you mentioned before in https://github.com/saigegit/SAIGE/issues/12?

Thanks, Shea

surakshavinod commented 1 year ago

Hi, has the "chunk is empty" issue been solved? I'm facing the same issue wherein an empty output file is being generated.

The following is displayed in the log repeatedly as each chunk is analyzed, so ultimately a blank file with only the header is getting generated as an output:

x markers in 'RegionFile' are not in 'GenoFile'.
Read in  100  region(s) from the group file.
The chunk is empty

The groupfile is in the right format (chr:pos:ref:alt), so I'm unsure as to where I'm going wrong. It's probably worth mentioning that I'm using the same plink files used as input in step 1 for step 2 as well. Is this where I'm making a mistake?

Thanks, Suraksha

SheaCheng2000 commented 1 year ago

Hi Suraksha, @surakshavinod

I solved this problem after using a newer version of SAIGE. 😂 It seems that the plink files in step 1 are just 1000 random markers for variance ratio estimation, as described here https://saigegit.github.io/SAIGE-doc/docs/set_step1.html. But the plink files in step 2 are the complete files to run burden.

Hope it helps!

Shea

surakshavinod commented 1 year ago

Hi @SheaCheng2000,

Thank you so much for your quick reply! The problem is I'm getting a blank output file both times, i.e., when I'm using the complete plink file for that particular chromosome in both step 1 and step 2, and when I use random markers plink file in step 1 and the complete plink file for that chromosome (say 21) in step 2, so I'm not sure where exactly I'm going wrong.

Thanks, Suraksha

saigegit commented 1 year ago

Hi @surakshavinod,

Can you please make sure the IDs in the group file matches the ID column in the PLINK file in step 2? Pleas try the most recent version.

Thanks, Wei