mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
178 stars 16 forks source link

merging quits at the last step #4

Closed aseetharam closed 4 years ago

aseetharam commented 4 years ago

Hello,

I'm trying to combine the SV's called using NGMLR+SNIFFLES of 27 individuals. WIth iris enabled, everything runs smoothly till it reaches the outputting results step, when it abruptly quits (after writing results for 1 or 2 chr). Any idea what's wrong?

My command:

java -Xmx200g -Djava.io.tmpdir=$TMPDIR -cp ${JASMINE}/src:${JASMINE}/Iris/src Main file_list=vcf.fofn out_file=jasmine.vcf bam_list=bam.fofn genome_file=reference.fasta threads=16  out_dir=jasmine-temp --run_iris --output_genotypes

The error:

...
Merging graph ID: scaf_37_INS_+-
Merging complete - outputting results
Exception in thread "main" java.lang.NullPointerException
        at VariantOutput$VariantGraph.updateOutputVariant(VariantOutput.java:293)
        at VariantOutput$VariantGraph.processVariant(VariantOutput.java:561)
        at VariantOutput.writeMergedVariants(VariantOutput.java:103)
        at Main.runJasmine(Main.java:76)
        at Main.main(Main.java:22)

(files are sorted and has not been modified after running SNIFFLES except to remove non-chr SVs)

Thanks in advance!

aseetharam commented 4 years ago

Is there a test dataset that I could try to make sure my installation is in order?

Thanks!

mkirsche commented 4 years ago

Hi Arun,

Thank you for your interest! I have just pushed a small test case which can be run with ./smalltest.sh so that you can test your build. As for the error you are encountering, I have been unable to replicate it on the datasets I have available. Is the data you are using publicly available and/or is there a subset of the VCFs which you can share? If so it would be greatly helpful as I look into what's going on.

Best, Melanie

aseetharam commented 4 years ago

Thanks for the reply and for the test dataset! It runs fine on the dataset, however, fails on my data again. I'll see if I can share the small subset of this data with you to further troubleshoot. FYI, the complete stdout is attached below: jasmine_stdout.txt

Thanks,

aseetharam commented 4 years ago

Hi Melanie,

I noticed that your test dataset does not have a header line (that starts with #CHROM) and also, after combining all three VCF, the merged.vcf has a single individual. Is this right? My assumption was that I could use Jasmine for joint-calling SVs across multiple individuals. Please let me know if this is correct.

Thanks,

mkirsche commented 4 years ago

Hi Arun,

Thanks for the extra info! If you are able to share some of the data and don't have anywhere to host it, you can email it to me at [github username] at jhu dot edu.

As for the test case, I understand the confusion. This test is without outputting genotypes, so the SUPP_VEC INFO field is a binary vector which represents the presence/absence in the (2 in this case) individuals, but the per-sample information is not output. In this test case, a.vcf and b.vcf are the inputs and c.vcf is the expected output which gets compared against after running Jasmine. I hope that clears things up a bit!

Melanie

mkirsche commented 4 years ago

Hi Arun,

Thanks again for pointing this out! I recently pushed a change based on another user getting a similar error, and their problem was that their input VCFs had multiple variants with the same ID (while the VCF format requires that they be unique). Jasmine now handles such cases more gracefully, outputting a warning message and adding suffixes to the IDs to make them unique for the purposes of merging and outputting results. However, if you do have duplicate IDs I would still advise changing them to be unique within each input file so that you can more easily and accurately trace back the input variants which got merged together.

Please let me know if you're still encountering the error message with the updated code, and if so I'll continue to investigate.

Thank you! Melanie