pcingola / SnpEff

Other
244 stars 78 forks source link

multithreaded snpEff or merging existing and new vcf #285

Closed intikhab closed 3 years ago

intikhab commented 3 years ago

Dear There @snpEFF,

We process large number of isolates of a virus, daily, and produce an annotated vcf using snpEff. Now the numbers of samples are reaching 300,000 and it is taking more than half a day to complete annotation of vcf using snpEff. Is it possible to incrementally add new samples snpEff annotated vcf to existing large snpEff based vcf?

Or process combined bing VCF through snpEFF for annotation using multiple threads? -t option always fail from the last version.

Any suggestions?

Thanks,

IA

pcingola commented 3 years ago

How do you process these 300,000 samples per day?

Additional information:

intikhab commented 3 years ago

Dear Pablo,

We save snpEff based VCF from the latest data (small set of samples from daily updates) and a VCF processed earlier (large vcf from all previous samples), where all multiple samples are stored.

If we could combined these two VCFs, it could save some time. At present we are processing all samples (small + large sample set) through a new run of snpEff using raw one VCF of all samples, so that we can obtain a final annotated VCF, summary html and genes text file.

Example summary file is show on this link below:

https://www.cbrc.kaust.edu.sa/covmt/data/Variants/snpEff_summary.html?

Exact command line looks like:

SnpEff -no-upstream -no-utr -no-downstream -classic -i vcf Ref Raw_vcf

daily updates some time have 100 samples, some time >5000 samples. Total samples are now touching ~300,000 samples.

We increased the memory of the server processing snpEff, but still it takes ~3-6 hours.

I tried multithreaded option from previous run but it does not work.

If we could merge existing large snpEff VCF from previous day processing and new small snpEff VCF from daily updates, it could save us some time.

Best,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sahttps://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127


From: Pablo Cingolani notifications@github.com Sent: 28 December 2020 17:54 To: pcingola/SnpEff Cc: Intikhab S. Alam; Author Subject: Re: [pcingola/SnpEff] multithreaded snpEff or merging existing and new vcf (#285)

How do you process these 300,000 samples per day?

Additional information:

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/pcingola/SnpEff/issues/285*issuecomment-751737663__;Iw!!Nmw4Hv0!h7BAuWH9fKwzK6M3DEUgHruJc1QiqrRt6SUeSUUL6DdPwaPEk2KdH6J50cbgE5j5q2Qz2_c$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63EV6AEVXBEUC4BOUGSDSXCLZZANCNFSM4VKKD3FQ__;!!Nmw4Hv0!h7BAuWH9fKwzK6M3DEUgHruJc1QiqrRt6SUeSUUL6DdPwaPEk2KdH6J50cbgE5j5WrottBI$.

pcingola commented 3 years ago

If I understand correctly, you are adding new samples as columns in a large VCF file containing all the samples you've sequenced so far, which totals ~300,000 samples. So you have a VCF file with ~29,000 lines (roughly the number of total variants you have in the VCF file according to the summary you've shared) and each line has 300,000 columns. Is this correct?

Speed up tips

P.S.: Your report shows a significant number of variants with quality below 20, maybe you need to filter your variants better and/or improve the quality control in your sequencing.