mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
174 stars 16 forks source link

Support for vcf.gz files #31

Open fgvieira opened 2 years ago

fgvieira commented 2 years ago

It seems Jasmine does not support vcf.gz or bcf files:

Warning: input.vcf.gz ends with .gz, but (b)gzipped VCFs are not accepted
Exception in thread "main" java.lang.Exception: input.vcf.gz is a gzipped file, but only unzipped VCFs are accepted

Since it is quite a standard format, would it be possible for Jasmine to support both vcf.gz and bcf files? thanks,

mkirsche commented 2 years ago

Hi,

Thanks for the suggestion! Unfortunately, adding support for vcf.gz and .bcf files would require fairly extensive software changes and so there are no plans in the near future to do so since the majority of SV calling software produces unzipped VCF files.

Melanie

fgvieira commented 2 years ago

I understand that it might a bit of work, but maybe you could use an existing library to read the VCF files, like htsjdk (developed by the Broad Institute).

At this point it has only partial support for VCF (VCFv4.3 can be read but not written and there is no support for BCFv2.2), but at least you can read and write VCFv4.2 (both text and gz versions). And when they implement the rest Jasmine will automatically support them!

tnguyengel commented 4 months ago

I would like to bump this. Unzipping VCFs for large datasets is highly undesirable in terms of storage costs. Most bioinformatic tools are able to operate off of either compressed VCFs or some other lightweight binary format, which limits the reusability of the unzipped VCFs. Compression or binary support would be very much appreciated!