vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

gene set based burden test #119

Open qserenali opened 5 years ago

qserenali commented 5 years ago

I did a test run with a single chromosome data on gene set based burden test, knowing that genes belonging to the same pathway may be from different chromosomes. The test run worked.

more ~/test/chr22_6/CFisher4.txt
WARNING: Sample 11882X3 is ignored due to missing value for phenotype affection
WARNING: Sample 201001917-93213 is ignored due to missing value for phenotype affection
INFO: 809 samples are found
INFO: 136 groups are found
INFO: Starting 8 processes to load genotypes
Loading genotypes:  55.5% [==============>            ] 449 45.2/s in 00:00:08
Loading genotypes: 100% [=============================] 809 39.9/s in 00:00:00
 39.9/s in 00:00:20
Testing for association: 100% [=======================] 136 34.3/s in 00:00:03
keggpathway_kgdesc                                              sample_size_CFisher     num_variants_CFisher    total_mac_CFisher       statistic_CFisher   pvalue_CFisher
Acute myeloid leukemia                                          517                     6                       7                       2.06989             0.465952
Adherens junction                                               807                     42                      105                     1.09034             0.728531
Alanine, aspartate and glutamate metabolism                     756                     6                       9                       0.833929            1
Adipocytokine signaling pathway                                 807                     25                      53                      0.70144

Now I created the full data set with all 22 chromosome data, and re-run the analysis. Surprisingly, only one gene set returned association results and most gave the following error message.

I see the intermediate query as follow, but I don't fully understand it. Does the burden test supports group by KeggID for example? Thanks!

Running query INSERT INTO __asso_tmp SELECT DISTINCT variants_w_kgID.variant_id, 0, keggPathway.keggPathway.KgID  FROM variants_w_kgID, variant, ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd, ccdsGene.ccdsGene, keggPathway.keggPathway WHERE (variants_w_kgid.variant_id = variant.variant_id) AND (variant.bin = ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd.bin AND variant.chr = ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd.chr AND variant.pos >= ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd.start AND variant.pos <= ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd.end ) AND (ccdsGene.ccdsGene.rowid = ccdsGene.__rng_ccdsGene_hg19_chr_cdsStart_cdsEnd.range_id) AND (keggPathway.keggPathway.ccdsId=ccdsGene.ccdsGene.name);
$ tail main.log
2019-09-23 20:27:28,332: DEBUG: An ERROR has occurred in process 7 while processing 'Oocyte meiosis': Sample size too small (0) to be analyzed for 'Oocyte meiosis'.
2019-09-23 20:27:33,565: DEBUG: An ERROR has occurred in process 6 while processing 'Pancreatic cancer': Sample size too small (0) to be analyzed for 'Pancreatic cancer'.
2019-09-23 20:27:51,297: DEBUG: An ERROR has occurred in process 2 while processing 'Olfactory transduction': Sample size too small (0) to be analyzed for 'Olfactory transduction'.
2019-09-23 20:28:00,441: DEBUG: An ERROR has occurred in process 0 while processing 'Pantothenate and CoA biosynthesis': Sample size too small (0) to be analyzed for 'Pantothenate and CoA biosynthesis'.
2019-09-23 20:29:17,703: DEBUG: An ERROR has occurred in process 1 while processing "Parkinson's disease": Sample size too small (0) to be analyzed for "Parkinson's disease".
gaow commented 5 years ago

Does the burden test supports group by KeggID for example? Thanks!

@qserenali it should support grouping by anything.

Surprisingly, only one gene set returned association results

How about those 4 genes sets that used to work on chrom 22? From the face value of the error message it simply says all your samples are missing variants in those genes from that pathway, which doesnt seem likely right? A method to debug would be first create a variant table involving one gene set in question then you count number of variants and number of samples within it.

qserenali commented 5 years ago

Thanks for the tip. I think I figured out. When I created the project using the child projects, vtools thinks there are 81122 samples and hence missing = 81122-809=17033 for the first row of data below. For the sample table sample_name is the same for the data from 22 imports. Is there a way to make vtools know that those data belong to the same subject? Or l have to import the vcf using one file only? I am surprised though that there was one gene set that made it through.

keggpathway_kgid sample_size_CFisher num_variants_CFisher total_mac_CFisher statistic_CFisher pvalue_CFisher hsa00472 803 7 13 1.14471 0.778341

vtools init main --children ../chr1 ../chr2 ../chr3 ../chr4 ../chr5 ../chr6 ../chr7 ../chr8 ../chr9 ../chr10 ../chr11 ../chr12 ../chr13 ../chr14 ../chr15 ../chr16 ../chr17 ../chr18 ../chr19 ../chr20 ../chr21 ../chr22 [qli2@awsahenva1007 main]$ vtools output hsa04740_variants chr pos ref alt total wildtype mutants missing num hom het other -l 20 1 949472 G A 809 809 0 17033 0 0 0 0 1 949491 G A 809 808 1 17033 1 0 1 0 1 949597 C T 809 789 20 17033 20 0 20 0

[qli2@awsahenva1007 main]$ vtools show samples -l 25 sample_name filename phenotype affection _merge_from 100016 /home/ql....recode.vcf.gz Case 2 chr1 100016 /home/ql....recode.vcf.gz Case 2 chr2 100016 /home/ql....recode.vcf.gz Case 2 chr3 100016 /home/ql....recode.vcf.gz Case 2 chr4 100016 /home/ql....recode.vcf.gz Case 2 chr5 100016 /home/ql....recode.vcf.gz Case 2 chr6 100016 /home/ql....recode.vcf.gz Case 2 chr7 100016 /home/ql....recode.vcf.gz Case 2 chr8 100016 /home/ql....recode.vcf.gz Case 2 chr9 100016 /home/ql....recode.vcf.gz Case 2 chr10 100016 /home/ql....recode.vcf.gz Case 2 chr11 100016 /home/ql....recode.vcf.gz Case 2 chr12 100016 /home/ql....recode.vcf.gz Case 2 chr13 100016 /home/ql....recode.vcf.gz Case 2 chr14 100016 /home/ql....recode.vcf.gz Case 2 chr15 100016 /home/ql....recode.vcf.gz Case 2 chr16 100016 /home/ql....recode.vcf.gz Case 2 chr17 100016 /home/ql....recode.vcf.gz Case 2 chr18 100016 /home/ql....recode.vcf.gz Case 2 chr19 100016 /home/ql....recode.vcf.gz Case 2 chr20 100016 /home/ql....recode.vcf.gz Case 2 chr21 100016 /home/ql....recode.vcf.gz Case 2 chr22 100387 /home/ql....recode.vcf.gz Case 2 chr1

From: gaow notifications@github.com Sent: Monday, September 23, 2019 5:59 PM To: vatlab/varianttools varianttools@noreply.github.com Cc: Li, Qingqin [JRDUS] QLi2@its.jnj.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [vatlab/varianttools] gene set based burden test (#119)

Does the burden test supports group by KeggID for example? Thanks!

@qserenalihttps://github.com/qserenali it should support grouping by anything.

Surprisingly, only one gene set returned association results

How about those 4 genes sets that used to work on chrom 22? From the face value of the error message it simply says all your samples are missing variants in those genes from that pathway, which doesnt seem likely right? A method to debug would be first create a variant table involving one gene set in question then you count number of variants and number of samples within it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/vatlab/varianttools/issues/119?email_source=notifications&email_token=ADHLYTM2PTLMTUCRT2WND23QLE33NA5CNFSM4IZRS5CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MMZSA#issuecomment-534301896, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADHLYTN2DHYT2TWM3VQEO6LQLE33NANCNFSM4IZRS5CA.

qserenali commented 5 years ago

For the single gene set that worked, it was because it contained only one gene. [qli2@awsahenva1007 main]$ vtools output hsa00472_variants chr pos ref alt total wildtype mutants missing num hom het other -l 20 12 109283278 C T 809 809 0 17033 0 0 0 0 12 109284027 T C 810 808 2 17032 2 0 2 0 12 109293187 G A 806 805 1 17036 1 0 1 0 12 109294252 C T 811 811 0 17031 0 0 0 0 12 109278806 A T 809 807 2 17033 2 0 2 0 12 109283273 G A 809 808 1 17033 1 0 1 0 12 109292482 C T 810 805 5 17032 5 0 5 0 12 109294209 C T 811 810 1 17031 1 0 1 0 12 109294301 C T 811 810 1 17031 1 0 1 0

From: gaow notifications@github.com Sent: Monday, September 23, 2019 5:59 PM To: vatlab/varianttools varianttools@noreply.github.com Cc: Li, Qingqin [JRDUS] QLi2@its.jnj.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [vatlab/varianttools] gene set based burden test (#119)

Does the burden test supports group by KeggID for example? Thanks!

@qserenalihttps://github.com/qserenali it should support grouping by anything.

Surprisingly, only one gene set returned association results

How about those 4 genes sets that used to work on chrom 22? From the face value of the error message it simply says all your samples are missing variants in those genes from that pathway, which doesnt seem likely right? A method to debug would be first create a variant table involving one gene set in question then you count number of variants and number of samples within it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/vatlab/varianttools/issues/119?email_source=notifications&email_token=ADHLYTM2PTLMTUCRT2WND23QLE33NA5CNFSM4IZRS5CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MMZSA#issuecomment-534301896, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADHLYTN2DHYT2TWM3VQEO6LQLE33NANCNFSM4IZRS5CA.