multithreaded snpEff or merging existing and new vcf

intikhab commented 3 years ago

Dear There @snpEFF,

We process large number of isolates of a virus, daily, and produce an annotated vcf using snpEff. Now the numbers of samples are reaching 300,000 and it is taking more than half a day to complete annotation of vcf using snpEff. Is it possible to incrementally add new samples snpEff annotated vcf to existing large snpEff based vcf?

Or process combined bing VCF through snpEFF for annotation using multiple threads? -t option always fail from the last version.

Any suggestions?

Thanks,

IA

pcingola commented 3 years ago

How do you process these 300,000 samples per day?

Are ALL samples combined into a single multi-sample VCF?
Are the samples processed independently by invoking the SnpEff command 300,000 times?

Additional information:

Can you share one of the files (or at least a portion of it) so I can look at it and recommend a solution/strategy?
What is the exact command line you are using to process?

intikhab commented 3 years ago

Dear Pablo,

We save snpEff based VCF from the latest data (small set of samples from daily updates) and a VCF processed earlier (large vcf from all previous samples), where all multiple samples are stored.

If we could combined these two VCFs, it could save some time. At present we are processing all samples (small + large sample set) through a new run of snpEff using raw one VCF of all samples, so that we can obtain a final annotated VCF, summary html and genes text file.

Example summary file is show on this link below:

https://www.cbrc.kaust.edu.sa/covmt/data/Variants/snpEff_summary.html?

Exact command line looks like:

SnpEff -no-upstream -no-utr -no-downstream -classic -i vcf Ref Raw_vcf

daily updates some time have 100 samples, some time >5000 samples. Total samples are now touching ~300,000 samples.

We increased the memory of the server processing snpEff, but still it takes ~3-6 hours.

I tried multithreaded option from previous run but it does not work.

If we could merge existing large snpEff VCF from previous day processing and new small snpEff VCF from daily updates, it could save us some time.

Best,

IA

--

Intikhab Alam, PhD

Research Scientist Computational Bioscience Research Centre (CBRC), Building #3, Office #4328 4700 King Abdullah University of Science and Technology (KAUST) Thuwal 23955-6900, KSA W: http://www.kaust.edu.sa https://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa T +966 (0) 2 808-2423 F +966 (2) 802 0127

From: Pablo Cingolani notifications@github.com Sent: 28 December 2020 17:54 To: pcingola/SnpEff Cc: Intikhab S. Alam; Author Subject: Re: [pcingola/SnpEff] multithreaded snpEff or merging existing and new vcf (#285)

How do you process these 300,000 samples per day?

Are ALL samples combined into a single multi-sample VCF?
Are the samples processed independently by invoking the SnpEff command 300,000 times?

Additional information:

Can you share one of the files (or at least a portion of it) so I can look at it and recommend a solution/strategy?
What is the exact command line you are using to process?

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/pcingola/SnpEff/issues/285*issuecomment-751737663__;Iw!!Nmw4Hv0!h7BAuWH9fKwzK6M3DEUgHruJc1QiqrRt6SUeSUUL6DdPwaPEk2KdH6J50cbgE5j5q2Qz2_c$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63EV6AEVXBEUC4BOUGSDSXCLZZANCNFSM4VKKD3FQ__;!!Nmw4Hv0!h7BAuWH9fKwzK6M3DEUgHruJc1QiqrRt6SUeSUUL6DdPwaPEk2KdH6J50cbgE5j5WrottBI$.

pcingola commented 3 years ago

If I understand correctly, you are adding new samples as columns in a large VCF file containing all the samples you've sequenced so far, which totals ~300,000 samples. So you have a VCF file with ~29,000 lines (roughly the number of total variants you have in the VCF file according to the summary you've shared) and each line has 300,000 columns. Is this correct?

Speed up tips

If you don't need the summary HTML page, you can run SnpEff using the -noStats, which should run significantly faster.
Adding more memory definitely helps (i.e. increase the -Xmx parameter). Maybe you can share the full command line, including the java command line option.
For additional speedups, I'd need to test using the source file (or at least part of it), but I don't see it publicly shared in your link.

Pablo

P.S.: Your report shows a significant number of variants with quality below 20, maybe you need to filter your variants better and/or improve the quality control in your sequencing.

pcingola / SnpEff

multithreaded snpEff or merging existing and new vcf #285

--