Closed Hoeze closed 5 months ago
thanks @Hoeze do you deal with a lot of malformed schemas?
You can also read a vcf that does conform and use df.schema to get the schema
The challenge is that schemas do vary considerably depending on what INFO, FORMAT and FILTER fields are present.
If VCFs are not conforming often it is prudent to reprocess the VCF files from BAM or FASTQ or write a script to modify them before loading into Spark
@williambrandler Yes, that's exactly how I got my default schema. My concrete problem was that some VCFs had (seemingly hand-edited) VCF headers where field data types differed.
It took me a while to find out that the best solution to fix this issue is to just specify .schema()
before loading them :)
This is a good suggestion. I'll incorporate into the docs. Thanks @Hoeze.
Super useful to clean up malformed VCFs.
Example: