Regarding improved performance during cosmic.pl annotation phase

cookersjs commented 7 years ago

Hi @morungos,

So VEP has run successfully on the COSMIC v75 CosmicCodingMuts.vcf file via perl cosmic.pl --force This took around 28 hours on my little laptop, finishing with ~3.3 million variants processed.

The step immediately following VEP uses the output to perform the annotation phase of the program.

The output of this section is repeated output of lines that look like:

Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000182873 - skipping Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000182873 - skipping Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000234296 - skipping Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000234296 - skipping Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000234296 - skipping Heliotrope.Update.COSMIC 302 - Processed 10000 lines

The issues I have been noticing with this section, are that the program takes increasingly longer to go through the VEP output file.

For instance: First 10000 lines: 1.5 minutes 20000 lines: 4 minutes (per 10000 lines) 50000 lines: 10 minutes (per 10000 lines) 100000 lines: 23 minutes (per 10000 lines)

The file is over 3 million lines long, so by the time it gets to processing the end of the file, its going to take an extremely long period of time per 10000 lines. I think the problem is that the annotation phase 'looks' for the next gene ID each time by starting at the beginning of the file. This would explain why the time to process takes increasingly longer.

Do you know if there was supposed to be an index generated, either via VEP or through another program on my system? I have only briefly looked into things that might improve this performance issue, such as tabix or PyVCF.

Thanks! -cookersjs

morungos commented 7 years ago

I’ve seen issues like this, and on the whole, using VEP in forking mode is now a good plan, as it stops most of the leaking memory that causes this. That is a lot more stable than it used to be, and even uses less memory, so i’d start with maybe trying that.

On Nov 23, 2016, at 2:31 PM, Justin Cook notifications@github.com wrote:

Hi @morungos https://github.com/morungos,

So VEP has run successfully on the COSMIC v75 CosmicCodingMuts.vcf file via perl cosmic.pl --force This took around 28 hours on my little laptop, finishing with ~3.3 million variants processed.

The step immediately following VEP uses the output to perform the annotation phase of the program.

The output of this section is repeated output of lines that look like:

Heliotrope.Update.COSMIC 433 - Failed to find gene: ENSG00000234296 - skipping

Every 10000 lines it updates me to let me know where its at.

The issues I have been noticing with this section, are that the program takes increasingly longer to go through the VEP output file.

For instance: First 10000 lines: 1.5 minutes 20000 lines: 4 minutes (per 10000 lines) 50000 lines: 10 minutes (per 10000 lines) 100000 lines: 23 minutes (per 10000 lines)

The file is over 3 million lines long, so by the time it gets to processing the end of the file, its going to take an extremely long period of time per 10000 lines. I think the problem is that the annotation phase 'looks' for the next gene ID each time by starting at the beginning of the file. This would explain why the time to process takes increasingly longer.

Do you know if there was supposed to be an index generated, either via VEP or through another program on my system? I have only briefly looked into things that might improve this performance issue, such as tabix or PyVCF.

Thanks! -cookersjs

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oicr-ibc/heliotrope/issues/73, or mute the thread https://github.com/notifications/unsubscribe-auth/AAa-6vmCGeF31tK-3q7dszacR3GwJxXFks5rBJSJgaJpZM4K69_e.

cookersjs commented 7 years ago

When I originally ran it through cosmic.pl, the '--fork 8' tag was already a part of the command. I compared the speed (in variants processed/s) using the 8 and 4 forks on my computer and I didn't see much of an improvement.

Regarding the COSMIC.pm annotation phase script, would looking into Parallel::ForkManager be a logical step towards improving the performance in this section of the code?

morungos commented 7 years ago

If fork isn’t going to cut it, I don’t know. We might just be stuck with it.

Indexing might be worth trying, but I could never build the Perl modules at UHN, they needed C and some weird stuff.

Better might be to run a small set and use some profiling to see if there’s an obvious bottleneck.

All the best Stuart

On Nov 23, 2016, at 3:04 PM, Justin Cook notifications@github.com wrote:

When I originally ran it through cosmic.pl, the '--fork 8' tag was already a part of the command. I compared the speed (in variants processed/s) using the 8 and 4 forks on my computer and I didn't see much of an improvement.

Regarding the COSMIC.pm annotation phase script, would looking into Parallel::ForkManager be a logical step towards improving the performance in this section of the code?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oicr-ibc/heliotrope/issues/73#issuecomment-262617670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAa-6mZGcPf8qG3yohQbZNG0w9vu0o2nks5rBJxNgaJpZM4K69_e.

cookersjs commented 7 years ago

So I decided to try forking again but with less forks. The default you had originally is 8, and I tried with 4 forks originally, and saw no difference in performance.

I ran it with only 2 forks and saw a pretty significant jump in VEPs speed. Seems I overestimated how much power the laptop-run VM had by trying even four forks.

I left it to run overnight, so I'll update as necessary

Remimstr commented 7 years ago

This issue was moved to lstein/Heliotrope#2

oicr-ibc / heliotrope

Regarding improved performance during cosmic.pl annotation phase #73