projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

Cannot flatten info field names containing points #371

Open Hoeze opened 3 years ago

Hoeze commented 3 years ago

Hi, I just tried to load a VCF file where info fields' names contained points:

##INFO=<ID=Func.refGene,Number=.,Type=String,Description="Func.refGene annotation provided by ANNOVAR">
##INFO=<ID=Gene.refGene,Number=.,Type=String,Description="Gene.refGene annotation provided by ANNOVAR">
##INFO=<ID=GeneDetail.refGene,Number=.,Type=String,Description="GeneDetail.refGene annotation provided by ANNOVAR">
##INFO=<ID=ExonicFunc.refGene,Number=.,Type=String,Description="ExonicFunc.refGene annotation provided by ANNOVAR">
##INFO=<ID=AAChange.refGene,Number=.,Type=String,Description="AAChange.refGene annotation provided by ANNOVAR">
##INFO=<ID=Xref.refGene,Number=.,Type=String,Description="Xref.refGene annotation provided by ANNOVAR">
##INFO=<ID=Func.ensGene,Number=.,Type=String,Description="Func.ensGene annotation provided by ANNOVAR">
##INFO=<ID=Gene.ensGene,Number=.,Type=String,Description="Gene.ensGene annotation provided by ANNOVAR">
##INFO=<ID=GeneDetail.ensGene,Number=.,Type=String,Description="GeneDetail.ensGene annotation provided by ANNOVAR">
##INFO=<ID=ExonicFunc.ensGene,Number=.,Type=String,Description="ExonicFunc.ensGene annotation provided by ANNOVAR">
##INFO=<ID=AAChange.ensGene,Number=.,Type=String,Description="AAChange.ensGene annotation provided by ANNOVAR">
##INFO=<ID=Xref.ensGene,Number=.,Type=String,Description="Xref.ensGene annotation provided by ANNOVAR">
##INFO=<ID=Func.knownGene,Number=.,Type=String,Description="Func.knownGene annotation provided by ANNOVAR">
##INFO=<ID=Gene.knownGene,Number=.,Type=String,Description="Gene.knownGene annotation provided by ANNOVAR">
##INFO=<ID=GeneDetail.knownGene,Number=.,Type=String,Description="GeneDetail.knownGene annotation provided by ANNOVAR">
##INFO=<ID=ExonicFunc.knownGene,Number=.,Type=String,Description="ExonicFunc.knownGene annotation provided by ANNOVAR">
##INFO=<ID=AAChange.knownGene,Number=.,Type=String,Description="AAChange.knownGene annotation provided by ANNOVAR">
##INFO=<ID=Xref.knownGene,Number=.,Type=String,Description="Xref.knownGene annotation provided by ANNOVAR">
##INFO=<ID=ALLELE_END,Number=0,Type=Flag,Description="Flag the end of ANNOVAR annotation for one alternative allele">

The following script fails on this VCF:

df = (
    spark
    .read
    .option("flattenInfoFields", True)
    .format('vcf')
    .load(snakemake.input['input_vcf'])
)
df = df.where(
    f.col('contigName').isin(snakemake.params['chroms'])
)
df = glow.transform("split_multiallelics", df)
df = glow.transform("normalize_variants", df, reference_genome_path=snakemake.input['reference_fasta'])
lifted_df = glow.transform('lift_over_variants', df, chain_file=snakemake.input['chain_file'], reference_file=snakemake.input['reference_fasta'])
lifted_df.write.format("bigvcf").save(snakemake.output['lifted_vcf'])

I'm using PySpark 3.1.1 with "io.projectglow:glow-spark3_2.12:1.0.0"

karenfeng commented 3 years ago

I ran through a test, and it looks like the VCF header was being parsed properly. My best guess is that because there is a period in the INFO ID, Spark is acting as if it's a nested struct. Without an error stack, I'm not quite sure what error you're encountering and where. Could you provide more detail?

Hoeze commented 3 years ago

I have to ask my colleague if he can reproduce it, but it was indeed an error because of Spark treating a field as a struct. The error message was caused by some expression like

'`INFO_AAChange.refGene`'