molgenis / systemsgenetics

Generic Java genotype reader / writer, QTL mapping software, Strand alignment tool
https://github.com/molgenis/systemsgenetics/wiki
GNU General Public License v3.0
172 stars 100 forks source link

Error when converting .vcf file to .bgen: "Phased data not available" #632

Closed DanKaptijn closed 2 years ago

DanKaptijn commented 2 years ago

Hi,

As the title says I was trying to convert a .vcf genotype file (that I had prepared with tabix and bgzip) to .bgen format, however the process fails with the error message: "Phased data not available".

The commands were:

java -jar GenotypeHarmonizer.jar \
    --input /home/umcg-dkaptijn/my_files/test_dataset.vcf.gz \
    --output /home/umcg-dkaptijn/my_files/test_dataset \
    --outputType bgen

And then when I check the log file:

INFO - Version: 1.4.23
INFO - Current date and time: 2022-10-12 13:54:58
INFO - Log level: INFO
INFO - Input base path: /home/umcg-dkaptijn/my_files/test_dataset.vcf.gz 
INFO - Input data type: VCF file
INFO - Output base path: /home/umcg-dkaptijn/my_files/test_dataset
INFO - Output data type: Oxford Binary GEN / SAMPLE files
INFO - Reference base path not set, not performing harmonization.
INFO - Minimum posterior probability for input data: 0.4
INFO - LD checker off
INFO - Force input sequence name: not forcing
INFO - Debug mode: off
INFO - Input data loaded
INFO - No reference specified. Do conversion without alignment
INFO - Writing results
WARN - WARNING!!! writing dosage genotype data to .gen posterior probabilities file. Using heuristic method to convert to probabilities, this is not guaranteed to be accurate. See manual for more details.
INFO - Writing BGEN file /home/umcg-dkaptijn/my_files/test_dataset.bgen and sample file /home/umcg-dkaptijn/my_files/test_dataset.sample
FATAL - GenotypeDataException: Error writing output data: Phased data not available
org.molgenis.genotype.GenotypeDataException: Phased data not available
    at org.molgenis.genotype.vcf.VcfGenotypeData.getSampleProbabilitiesPhased(VcfGenotypeData.java:430)
    at org.molgenis.genotype.variant.sampleProvider.CachedSampleVariantProvider.getSampleProbabilitiesPhased(CachedSampleVariantProvider.java:146)
    at org.molgenis.genotype.variant.ReadOnlyGeneticVariant.getSampleGenotypeProbabilitiesPhased(ReadOnlyGeneticVariant.java:295)
    at org.molgenis.genotype.bgen.BgenGenotypeWriter.getPhasedGenotypeDataBlockByteBuffer(BgenGenotypeWriter.java:474)
    at org.molgenis.genotype.bgen.BgenGenotypeWriter.getGenotypeDataBlock(BgenGenotypeWriter.java:328)
    at org.molgenis.genotype.bgen.BgenGenotypeWriter.writeBgenFile(BgenGenotypeWriter.java:252)
    at org.molgenis.genotype.bgen.BgenGenotypeWriter.write(BgenGenotypeWriter.java:66)
    at org.molgenis.genotype.bgen.BgenGenotypeWriter.write(BgenGenotypeWriter.java:56)
    at nl.umcg.deelenp.genotypeharmonizer.GenotypeHarmonizer.main(GenotypeHarmonizer.java:434)

Would it be possible to have a flag which allows for the conversion of unphased data?

Thanks in advance, Dan

CAWarmerdam commented 2 years ago

The exception was already resolved with the current github version. But no new release had been made, hence the issue from Dan.

I propose to make a new release of genotype harmonizer and bump the version number.

In addition I have now made it possible to request a preferred genotype format field to read genotype data from VCF files. (see the linked pull request)

PatrickDeelen commented 2 years ago

New release can be found here: https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer-Download