Parallel instances - Githubissues

gitMakeCoffee commented 1 year ago

Hello,

I have been running PCGR (v1.4.1, GRCh37) on some clinical samples. These were just tests, so I didn't use proper pipelining workflows. I just used xargs to run samples in parallel :

# Sample names are stored in names.txt (About 16 samples)
cat names.txt | \
xargs -i -P 4 pcgr --assay WES --tumor_only --exclude_dbsnp_nonsomatic --estimate_tmb \
--input_vcf {}/{}-.vcf.gz --pcgr_dir pcgr/ --genome_assembly grch37 \
--sample_id {} --tumor_site 25 --output_dir RESULTS/

The following command does run correctly and does generate reports for all samples.

However, upon closer inspection, some outputs (including reports) for different samples are exactly identical, like they were mixed up. I double checked the input VCF files, which were of course very different.

Running xargs without the -P 4 option (ie running all samples sequentially) fixes the problem. In other words, it seems like this may be linked to PCGR running multiple instances in parallel.

Is it a known issue ? Thanks.

pdiakumis commented 1 year ago

Thanks for reporting @gitMakeCoffee - it's definitely not a known issue! I haven't personally tried to run PCGR in parallel over multiple samples like that since we use it as part of a production pipeline setup in the cloud, but will keep it in mind next time I'm testing locally.

Were the problematic VCF + HTML outputs written into RESULTS/ with the correct (per-sample) prefixes, but with identical results?

gitMakeCoffee commented 1 year ago

Thank you for the reply. Worth mentioning, I installed PCGR through Conda/Bioconda. Indeed, using pipelines would be ideal. This was a small project, and I just wanted to test out PCGR, so I ran it by hand. The VCF and HTML outputs were written with the correct sample names. However, some samples had the exact same file sizes. Upon closer inspection, the pcgr_acmg report has the correct names mentioned in the header, but scrolling down reveals that some samples have the exact same statistics and variants (although the input VCF files are different). Furthermore, a few samples had no output at all (no VCF, no HTML report). I tried rerunning the command to see if this was some random accident. The result is unfortunately the same when running PCGR in parallel, and is solved when running samples in sequence (xargs without -P, or using a loop). Does PCGR generate intermediate files under the hood ? Maybe those end up overwriting each other when multiple instances are running ?

sigven commented 1 year ago

Very interesting observation @gitMakeCoffee, Peter and myself have discussed it a bit already. And yes, PCGR generate many intermediate files under the hood, there might very well be some weaknesses there. Generally, I think that the most likely weakness (i.e. causing issues when running in parallel) is found in the last step of PCGR (reporting with RMarkdown), the first part should (in general..) be more robust when it comes to handling sample-specific output. However, on that note: Have you looked at the log files for the samples that did not produce any VCFs (I here refer to the PCGR-annotated VCFs, containing the pcgr_acmg tag)? Also, is it so that some of the pcgr_acmg VCF files from different samples (with different query VCFs) are identical?

Thanks again for reporting this, very valuable for us when it comes to improving the intermediate file handling. I am confident we will get to the bottom if it, and resolve it eventually :-)

best, Sigve

gitMakeCoffee commented 1 year ago

Thanks.

Regarding the samples without an HTML report, I do not have pcgr_acmg files. However, I do have tmp VCF files (and one index) in the following name format: SAMPLE.pcgr_ready.tmp2.vcf.gz, SAMPLE.pcgr_ready.tmp2.vcf.gz.tbi, SAMPLE.pcgr_ready.tmp3.vcf.gz
Regarding the pcgr_acmg files for the mixed up samples with an HTML report, the file sizes are practically identical, for example 125,984 kB vs 125,978 kB. Running diff and wc -l on all the gunzipped files finds 6166 different lines. Running less on those lines shows there is only a couple of header lines (containing the commands and filenames), but most of the entries are actual variants.

Sorry I can't share the output files, as these are sensitive clinical data. However, maybe these issues could be replicated with public VCF files.

Please let me know if you have any more questions, I'd be glad to help.

sigven / pcgr

Parallel instances #214