rounakdey / FastSparseGRM

Efficiently calculate ancestry-adjusted sparse GRM
MIT License
2 stars 1 forks source link

trouble with KING #11

Closed wangshuang2024 closed 9 months ago

wangshuang2024 commented 9 months ago

@rounakdey hello,I have some troubles with run the FastSparseGRM pipelineI. When I use king --ibdseg --degree 4 to create a KING. I just got few results of relative individual. thanks for your help.

rounakdey commented 9 months ago

How many subjects and how many SNPs are in the BED file you are using for KING? It is possible to have very few related individuals if in reality there are very few related individuals. Could you please share the KING log output. Thanks.

wangshuang2024 commented 9 months ago

How many subjects and how many SNPs are in the BED file you are using for KING? It is possible to have very few related individuals if in reality there are very few related individuals. Could you please share the KING log output. Thanks.

PLINK pedigrees loaded: 916 samples Genotype data consist of 8934264 autosome SNPs

rounakdey commented 9 months ago

Firstly, I see that you used king --related command for this. Please use king --ibdseg instead for the output to be in the proper format required for downstream analysis. Secondly, 25 pairs would roughly mean around 40-50 individuals are related with at least one other individual. That is a reasonable number in a cohort study of 916 subjects. Do you suspect that it should be much higher for some reason?

wangshuang2024 commented 9 months ago

Firstly, I see that you used king --related command for this. Please use king --ibdseg instead for the output to be in the proper format required for downstream analysis. Secondly, 25 pairs would roughly mean around 40-50 individuals are related with at least one other individual. That is a reasonable number in a cohort study of 916 subjects. Do you suspect that it should be much higher for some reason? I can use this result to run the rest steps?

Thanks for your reply. Actually,I try both of these two commands. Because this is my first time to use 'king'. I don't know what is the complete output results. I wonder after set '--degree 4', why the unrelated individuals didn't present a low value, even a negative one. So, you mean the result is right? $less GRM.seg FID1 ID1 FID2 ID2 IBD1Seg IBD2Seg PropIBD InfType D2007018876 D2007018876 D2012039814 D2012039814 0.2481 0.0000 0.1241 3rd D2012039815 D2012039815 D2012039816 D2012039816 0.5104 0.1688 0.4240 FS D2109086350 D2109086350 D2109086351 D2109086351 0.4871 0.1763 0.4199 FS D2012039911 D2012039911 D2012039912 D2012039912 0.4729 0.2310 0.4675 FS D2012039944 D2012039944 D2012039945 D2012039945 0.0212 0.9464 0.9570 Dup/MZ D2111095453 D2111095453 D2111095454 D2111095454 0.6256 0.1036 0.4164 FS D2111095455 D2111095455 D2111095471 D2111095471 0.2894 0.0011 0.1458 3rd D2111095482 D2111095482 D2111095490 D2111095490 0.2294 0.0000 0.1147 3rd D2111095482 D2111095482 D2111095493 D2111095493 0.2609 0.0000 0.1305 3rd D2111095490 D2111095490 D2111095493 D2111095493 0.5031 0.1966 0.4481 FS D2109086168 D2109086168 D2109086169 D2109086169 0.5009 0.1239 0.3744 FS D2109086174 D2109086174 D2109086139 D2109086139 0.1476 0.0000 0.0738 4th D2109086213 D2109086213 D2207136444 D2207136444 0.1044 0.0000 0.0522 4th D2109086214 D2109086214 D2109086215 D2109086215 0.0749 0.8495 0.8870 Dup/MZ D2109086241 D2109086241 D2109086245 D2109086245 0.4216 0.2747 0.4855 FS D2109086303 D2109086303 D2109086306 D2109086306 0.4416 0.3129 0.5337 FS D2012039994 D2012039994 D2012040002 D2012040002 0.5116 0.2172 0.4730 FS D2012040140 D2012040140 D2012040141 D2012040141 0.0543 0.9016 0.9288 Dup/MZ D2012040170 D2012040170 D2012040171 D2012040171 0.5305 0.2697 0.5350 FS D2012040182 D2012040182 D2012040183 D2012040183 0.5250 0.2549 0.5174 FS D2012040198 D2012040198 D2012040199 D2012040199 0.0376 0.9148 0.9336 Dup/MZ D2012040203 D2012040203 D2012040204 D2012040204 0.5118 0.1919 0.4478 FS D2111095511 D2111095511 D2111095512 D2111095512 0.4233 0.3285 0.5402 FS D2111095539 D2111095539 D2111095545 D2111095545 0.4681 0.2434 0.4775 FS D2109086042 D2109086042 D2109086043 D2109086043 0.4952 0.2002 0.4478 FS D2109086058 D2109086058 D2109086059 D2109086059 0.5495 0.1604 0.4352 FS D2109086076 D2109086076 D2109086077 D2109086077 0.4712 0.1979 0.4335 FS

wangshuang2024 commented 9 months ago

Firstly, I see that you used king --related command for this. Please use king --ibdseg instead for the output to be in the proper format required for downstream analysis. Secondly, 25 pairs would roughly mean around 40-50 individuals are related with at least one other individual. That is a reasonable number in a cohort study of 916 subjects. Do you suspect that it should be much higher for some reason?

Thank you. followed by the pipeline, I successfully go to the next step. But, I got another error in step4.

runPCA(prefix.in,file.unrels,no_pcs,num_threads,prefix.out,no_iter) [1] "PCA started at 2023-12-07 15:18:10" [1] "26 relateds, 890 unrelateds, and 8934264 variants" Reading the file: qc_4.bed Reading BED file complete Number of SNPs read: 8934264 Number of samples: 916 Using 36 threads Processing samples complete Allele frequencies calculated [1] "Reading Completed in 5 seconds" [1] "1 iterations completed" [1] "2 iterations completed" [1] "3 iterations completed" [1] "4 iterations completed" [1] "5 iterations completed" [1] "6 iterations completed" [1] "7 iterations completed" [1] "8 iterations completed" [1] "9 iterations completed" [1] "10 iterations completed" [1] "Iterations Completed in 391 seconds" Error: cannot create std::vector larger than max_size() Execution halted

R CMD BATCH --vanilla '--args --prefix.in qc_4 --file.unrels step3.unrels --prefix.out step4 --no_pcs 20 --num_threads 36 --no_iter 10' /biosofts/FastSparseGRM/extdata/runPCA_wrapper.R runPCA.Rout

The quality controll of my bfile was --maf 0.01 --geno 0.05. So, could you tell me how to solve this problem? thanks!

rounakdey commented 9 months ago

About the KING output: This output looks reasonable to me. The output does not list unrelated individuals or pairs, it only lists related pairs. This is in purpose to keep the output file short and save storage/memory, as otherwise it would have to list N*(N-1)/2 many lines costing huge memory and storage for large sample-sizes, when it is easy to infer unrelated subjects from just the list of related pairs.

rounakdey commented 9 months ago

About the PCA step: 8.9 Million is too many variants for PCA. Please filter by keeping only common variants (--maf 0.05), and also lightly LD prune them (--indep-pairwise 50 5 0.2) before using them for PCA (and also for step 5). A good number of variants for PCA would be 100,000-200,000. Also, you don't need to use the same plink file for steps 1-3 and steps 4-5. In fact, it is recommended that they are different, i.e., you can use more variants with lower MAF filter and no pruning for KING, and then use more filtered and pruned variants for PCA and sparseGRM calculation.

wangshuang2024 commented 9 months ago

About the PCA step: 8.9 Million is too many variants for PCA. Please filter by keeping only common variants (--maf 0.05), and also lightly LD prune them (--indep-pairwise 50 5 0.2) before using them for PCA (and also for step 5). A good number of variants for PCA would be 100,000-200,000. Also, you don't need to use the same plink file for steps 1-3 and steps 4-5. In fact, it is recommended that they are different, i.e., you can use more variants with lower MAF filter and no pruning for KING, and then use more filtered and pruned variants for PCA and sparseGRM calculation.

Thank you very much. I sincerely appreciate for your help. I got the final results!

wangshuang2024 commented 9 months ago

Sorry to bother you again. I have another problem. The final result I've got had a different dimnames compared with my seqID. How can I conserve a single ID, not like FID_IID?

sGRM@Dimnames[[1]] [1] "D2007018876_D2007018876" "D2012039814_D2012039814" "D2012039815_D2012039815"
[4] "D2012039816_D2012039816" "D2109086350_D2109086350" "D2109086351_D2109086351"
[7] "D2012039911_D2012039911" "D2012039912_D2012039912" "D2012039944_D2012039944"
[10] "D2012039945_D2012039945" "D2111095453_D2111095453" "D2111095454_D2111095454"
[13] "D2111095455_D2111095455" "D2111095471_D2111095471" "D2111095482_D2111095482"
[16] "D2111095490_D2111095490" "D2111095493_D2111095493" "D2109086168_D2109086168"
[19] "D2109086169_D2109086169" "D2109086174_D2109086174" "D2109086139_D2109086139"
[22] "D2109086214_D2109086214" "D2109086215_D2109086215" "D2109086241_D2109086241"
[25] "D2109086245_D2109086245" "D2109086303_D2109086303" "D2109086306_D2109086306"

xihaoli commented 9 months ago

Hi @wangshuang2024,

In your example, the FID and IID are all identical. If you want to keep a single ID, you can do the following:

sample_id <- unlist(lapply(strsplit(colnames(sGRM),"_"),`[[`,1))
colnames(sGRM) <- sample_id
rownames(sGRM) <- sample_id

Let us know if you have any questions.

Best, Xihao

wangshuang2024 commented 9 months ago

Hi @wangshuang2024,

In your example, the FID and IID are all identical. If you want to keep a single ID, you can do the following:

sample_id <- unlist(lapply(strsplit(colnames(sGRM),"_"),`[[`,1))
colnames(sGRM) <- sample_id
rownames(sGRM) <- sample_id

Let us know if you have any questions.

Best, Xihao

Thank you, Xihao. All of my problems have been solved.

xihaoli commented 9 months ago

You are welcome, @wangshuang2024. Great to know.

rounakdey commented 9 months ago

Thank you all. Closing this issue.