rounakdey / FastSparseGRM

Efficiently calculate ancestry-adjusted sparse GRM
MIT License
2 stars 1 forks source link

Fail to get correct result from king (Step 1) #7

Closed lxwgcool closed 1 year ago

lxwgcool commented 1 year ago

Dear Author,

I am currently running STAARpipeline for our research work. Since STAARpipeline recommends us use "FastSparseGRM" to calculate genetic principal components (PCs) and the ancestry-adjusted sparse genetic relatedness matrix (GRM), I tried to use "FastSparseGRM" to get the related results.

I am currently following the steps described in the section “How to run the FastSparseGRM pipeline”.

In order to run the step1, we need to obtain the plink bed file. The command line below was used to get this set of bed result from VCF instead of GDS.

plink --vcf ${VCFFile} --double-id --make-bed --out ${UnFilteredBed}

Actually, we do have the GDS file. The reason we do not use the way you recommended in repo is that our GDS file does not have the node "annotation/info/AVGDP" and "annotation/info/AC".

The size of bed file we got from the plink does make sense. image

After we got the bed file, we tried to run step1 by using the command line below:

king -b ${BEDFile} --ibdseg --degree 4 --cpus 1 --prefix ${OUTPUT}

The output is below: image

You can see the .seg file only contain the caption line without any real content. Here is the output and log file: step2.tar.gz

Since the output of this step is unexpected, all remaining steps are stuck. Could you please take a look and provide us some suggestions.

Thank you so much Bes regards Xin

rounakdey commented 1 year ago

From the output files you shared, I do not see anything wrong in KING output. It is possible KING did not find any related pair in your dataset. Do you suspect there might be related subjects present? Or was the sampling scheme somehow ensured only unrelated subjects were included in the study already, or some previous filtering of related subjects was already done? You can try running the king step again to see if it was some file writing malfunction in the computing cluster, or can run KING with the option -kinship instead of -ibdseg to see what -kinship option finds. Other investigative explorations can be done by setting degree more than 4. From what I see, it just seems that the subjects are most likely unrelated already. If you also come to that conclusion after running the above-mentioned exploration, then you cannot proceed with the sparse GRM calculation, as you do not need an LMM to fit the model. Just use traditional linear regression framework for that.

lxwgcool commented 1 year ago

Dear Rounak,

Thank you so much for your prompt reply. Yes, I do think we should have some related pairs in our dataset, since the number of variants and samples in our dataset is pretty large. I have solved this issue and successfully got the result from KING. However, I got the error when I run PCA by using "runPCA_wrapper.R". Please check the details below

Something about running the KING

Case 1 (Failed) The bed file I used contains the variant from multiple chromosomes that start from 1 (include chromosome 1, and10-19; this is the mistake in my script when I tried to grep the target chromosome from whole genome vcf file (I was using "“~ /chr1/” instead of "$1 == “chr1”)). This bed file gives me nothing from the KING.

Case 2 (Failed) As you mentioned in readme "FastSparseGRM pipeline requires data from all 22 autosomes in a single BED file". As a result, I converted the whole genome vcf file (including all 22 + sex chromosomes) to plink bed file, and then use this bed file as an input to run KING. Again get nothing.

Case 3 (Successful) In this case, I grabed the variants only from chromosome 1 and convert it from VCF to plink bed file. When I use this bed file to run KING, I can get the reasonable output successfully. Please check the attached result below (please check the folder step2/Step2/output). step2.tar.gz

I have two questions below:

  1. I am a little bit confused that why Case3 works while Case 1 and Case 2 failed. Could you provide some comments?
  2. As you mentioned that "FastSparseGRM pipeline requires data from all 22 autosomes in a single BED file", I just wonder if we really need to use the bed file that contains the whole chromosome? Any reason we need to use the whole chromosome in a single bed file? Based on my current testing, it looks like that single chromosome (e.g. chrom 1) works fine as well.

The Error I got while run PCA

I got the error "Error in svd(H, nv = 0) : infinite or missing values in 'x'" when I run "runPCA_wrapper.R". Please check the log file below: step5.tar.gz

I noticed that you mentioned "Keeping only common variants (MAF > 0.01, or 0.05 for smaller sample-size) in the BED file is recommended". We use plink to convert vcf to bed file, and the default value of "--blocks-min-maf = 0.05" has been used. Therefore, I think our bed file should have already been satisfied this requirement.

Since the output of this step is the input of the next step "Calculate Sparse GRM". Could you help us take a look at this issue?

Please feel free to let me know if you need any intermediate outputs.

Thank you so much Best Xin

rounakdey commented 1 year ago

The PCA error can be quickly fixed. It is indeed due to monomorphic SNPs present in the BED file. The --blocks-min-maf argument does not do MAF-based filtering, it is for selecting haplotypes. The argument that is appropriate here is --maf. Just run that filtering on the final BED file and you should be good to go.

On the previous questions:

  1. I am confused about this too. Did you include sex chromosomes in Case 2? Can you try excluding them?
  2. Even though running KING based on a single chromosome seemingly works fine in this case, genome-wide information makes the IBD segment estimation more accurate. So I will still suggest using genome-wide (22 autosomes) information. Maybe instead of --ibdseg, use --kinship mode in KING to run the analysis, just to see if that also produces 0 related pairs. I suspect something unexpected is happening when merging the chromosomes, maybe chromosome labels and/or the variant positions are being affected. We got to be careful, especially when we are seeing unexpected results (or maybe the results are fine, the subjects are indeed unrelated).
lxwgcool commented 1 year ago

Thanks Rounak,

I would like to focus on the PCA issue first, since getting the final running results is our first priority. I have regenerated the filtered BAM by using the "--maf 0.05". The bed file generated this time is only 20M (the unfiltered is 150M).

Then I repeated the step 1 to step 5. However, I still get the same error in the step of PCA. Please check the attachment below slurm-4147302.zip

I think this may be the code issue instead of input data, since all iterations have been completed successfully based on the log file. Could you help us take a look and solve this issue?

Thanks Best Xin

rounakdey commented 1 year ago

I am quite positive that it is due to filtering. It is a very common filtering issue that I have seen with monomorphic SNPs (and yes, all 10 iterations complete and then it shows the error, exactly like this). It is basically due to monomorphic SNPs have MAF 0, so when standardizing the genotypes, it results in a division by 0, which leads to NaN values in the matrix. The iterations are basically a bunch of matrix vector multiplications which also goes through with those NaN values without an issue, but when finally running the SVD, it throws error for that. Please check again by inspecting the allele counts. One subtle thing, the only condition is that there has to be at least one minor allele in the unrelated subjects. Typically, setting minimum MAF 0.05 solves this. I am suspecting two possible reasons:

  1. The filtering is not properly done. Please make sure the allele counts are not 0 for any SNP.
  2. One possibility is that one of the SNPs have extremely high missingness. In an extreme example, if a SNP's genotype is missing for all the subjects except 2, and one of them is allele count 1, and the other 0, then the MAF will be 0.5. Please check to avoid these situations by also using a missingness filter --geno 0.1 when filtering.
rounakdey commented 1 year ago

Alternatively, to MAF-based filtering, maybe try a MAC-based filtering instead, so that you don't need to worry about the missingness. You can try using "--mac 50" instead of "--maf 0.05 --geno 0.1". All the software wants is not having any SNP that is monomorphic among the unrelated subjects.

lxwgcool commented 1 year ago

Dear Rounak,

Thank you so much for your suggestion.

After adding "--geno 0.1" into the plink command line ("--maf 0.05 --geno 0.1" was applied), I can get both PCA and GRM outputs successfully.

Thanks so much for your help. I will continue the remaining steps of STAARpipeline and let you know if I have any additional questions related to the results of FastSparseGRM.

Best Xin

rounakdey commented 1 year ago

Glad to know that it worked.

lxwgcool commented 1 year ago

I would like to reopen this ticket.