projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

Improve docs: Document usage of schema when reading/writing VCF #473

Closed Hoeze closed 5 months ago

Hoeze commented 2 years ago

Super useful to clean up malformed VCFs.

Example:

default_vcf_schema = t.StructType([
    t.StructField("contigName", t.StringType()),
    t.StructField("start", t.LongType()),
    t.StructField("end", t.LongType()),
    t.StructField("names", t.ArrayType(t.StringType())),
    t.StructField("referenceAllele", t.StringType()),
    t.StructField("alternateAlleles", t.ArrayType(t.StringType())),
    t.StructField("qual", t.DoubleType()),
    t.StructField("filters", t.ArrayType(t.StringType())),
    t.StructField("splitFromMultiAllelic", t.BooleanType()),
    t.StructField("attributes", t.MapType(t.StringType(), t.StringType())),
    t.StructField("genotypes", t.ArrayType(t.StructType([
        t.StructField("sampleId", t.StringType()),
        t.StructField("conditionalQuality", t.IntegerType()),
#         t.StructField("MQ0", t.IntegerType()),
        t.StructField("alleleDepths", t.ArrayType(t.IntegerType())),
#         t.StructField("PID", t.StringType()),
        t.StructField("phased", t.BooleanType()),
        t.StructField("calls", t.ArrayType(t.IntegerType())),
#         t.StructField("PGT", t.StringType()),
#         t.StructField("phredLikelihoods", t.ArrayType(t.IntegerType())),
        t.StructField("depth", t.IntegerType()),
#         t.StructField("AB", t.DoubleType())
    ])))
])

df = (
    spark
    .read
    .format('vcf')
    .schema(default_vcf_schema)
    .load(vcf_files)
)
williambrandler commented 2 years ago

thanks @Hoeze do you deal with a lot of malformed schemas?

You can also read a vcf that does conform and use df.schema to get the schema

The challenge is that schemas do vary considerably depending on what INFO, FORMAT and FILTER fields are present.

If VCFs are not conforming often it is prudent to reprocess the VCF files from BAM or FASTQ or write a script to modify them before loading into Spark

Hoeze commented 2 years ago

@williambrandler Yes, that's exactly how I got my default schema. My concrete problem was that some VCFs had (seemingly hand-edited) VCF headers where field data types differed.

It took me a while to find out that the best solution to fix this issue is to just specify .schema() before loading them :)

henrydavidge commented 5 months ago

This is a good suggestion. I'll incorporate into the docs. Thanks @Hoeze.