googlegenomics / gcp-variant-transforms

GCP Variant Transforms
Apache License 2.0
135 stars 55 forks source link

Issue with vcf files failing #119

Closed andrewelamb closed 6 years ago

andrewelamb commented 6 years ago

Hi, we have some VCF files that are all failing this for different reasons. They are all submitted by different teams, and have been generated with a variety software. I'm wondering if someone could help figuring out what the issues are base don the errors we are getting back.

For example: Workflow failed. Causes: (e289b6735b30b391): S01:ReadFromVcf/Read+FilterVariants/ApplyFilters+VariantToBigQuery/ConvertToBigQueryTableRow+VariantToBigQuery/WriteToBigQuery/NativeWrite failed., (46d35657c88f910f): BigQuery import job \"dataflow_job_15413812696439016123-B\" failed., (46d35657c88f977c): BigQuery job \"dataflow_job_15413812696439016123-B\" in project \"neoepitopes\" finished with error(s): errorResult: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: call.AD.

It would seem that there is a problem with the AD field.

The header line looks like:

FORMAT=

I assume rows are starting from right after the column headers if so:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

x xxxxx . x x . PASS . GT:DP:AD 0/1:175:114,61 0/0:119:119,0

I had to x some of the data as that is proprietary.

arostamianfar commented 6 years ago

It looks like the issue is that the header definition contains three '###' instead of '##' as required by the VCF spec, so it's not being recognized by our tool. Is that the only field with three '#'s? Is it possible for you to fix the header in the original files? If not, please let me know and I can explain a workaround using --representative_header_file that you can use (I still need to document this :) ). Having said that, we are actively working on automatically 'detecting and correcting' these kinds of malformed VCF files in Issue #101.

andrewelamb commented 6 years ago

There are a few header lines with three ###'s I'll see if fixing those fixes the issue. Thanks!

andrewelamb commented 6 years ago

If you don't mind I have another couple errors that I'm not sure how to fix.

The google cloud error:

(ec831c06232ceaa2): Workflow failed. Causes: (c083238303c31d9b): S01:ReadFromVcf/Read+FilterVariants/ApplyFilters+VariantToBigQuery/ConvertToBigQueryTableRow+VariantToBigQuery/WriteToBigQuery/NativeWrite failed., (ec831c06232cec53): BigQuery import job \"dataflow_job_5744673768994726048-B\" failed., (ec831c06232ceabc): BigQuery job \"dataflow_job_5744673768994726048-B\" in project \"neoepitopes\" finished with error(s): errorResult: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON parsing error in row starting at position 0: Repeated field must be imported as a JSON array. Field: call.FA.

The header:

FORMAT=

The line:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

x x x x x . 6 DB GT:AD:BQ:DP:FA 0/1:0,94:31:94:1.00 0:0,106:.:106:1.00


google error:

ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0'

or ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: '-'


google error:

(7419290b1f1d4627): Workflow failed. Causes: (60fa81fc1db4126b): S01:ReadFromVcf/Read+FilterVariants/ApplyFilters+VariantToBigQuery/ConvertToBigQueryTableRow+VariantToBigQuery/WriteToBigQuery/NativeWrite failed., (7419290b1f1d48da): BigQuery import job \"dataflow_job_7438122687236218418-B\" failed., (7419290b1f1d4cb5): BigQuery job \"dataflow_job_7438122687236218418-B\" in project \"neoepitopes\" finished with error(s): errorResult: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON parsing error in row starting at position 0: Repeated field must be imported as a JSON array. Field: call.AO.

The header:

FORMAT=

The line:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

x x x x x 1836 PASS AB=0.263804;ABP=160.982;AC=1;AF=0.25;AN=4;ANN=T|missense_variant|MODERATE|TTC34|ENSG00000215912|transcript|ENST00000401095.8|protein_coding|4/9|c.1801G>A|p.Ala601Thr|1983/8814|1801/3240|601/1079||,T|missense_variant|MODERATE|TTC34|ENSG00000215912|transcript|ENST00000637179.1|protein_coding|2/7|c.262G>A|p.Ala88Thr|336/1775|262/1701|88/566||;AO=86;CALLERS=freebayes,vardict;CIGAR=1X;DP=629;DPB=629;DPRA=1.07591;EPP=13.1102;EPPR=6.38592;GTI=0;LEN=1;MEANALT=2;MQM=60;MQMR=60;NS=2;NUMALT=1;ODDS=210.311;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=3203;QR=20670;RO=541;RPL=37;RPP=6.64625;RPPR=52.4645;RPR=49;RUN=1;SAF=46;SAP=3.91929;SAR=40;SOMATIC;SRF=299;SRP=16.0512;SRR=242;TYPE=snp;technology.illumina=1 GT:AD:AO:DP:GQ:PL:QA:QR:RO 0/1:239,86:86:326:99:1902,0,7191:3203:9086:239 0/0:302,0:0:303:99:0,909,10416:0:11584:302

arostamianfar commented 6 years ago

ouch. Looks like there are several issues.


For "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10" error: Could you please confirm whether the VDB and RPB fields are defined as floats in the header? This error could happen if they were defined as integers. You can work around this issue by either changing the header to be Float or run the pipeline with --allow_malformed_records (such records will just get logged and won't be fatal).


For the Number=A fields: This is actually a bug in our code! We have special logic to convert single-valued fields to a list for FORMAT fields, but only if the number is unknown (code is here). We should do this for all other types (not sure why I originally did not do it). We'll fix it very soon (Issue #122). If you can't wait for the fix to be pushed, you can change all Number=A/G/R fields only in FORMAT fields to be Number=. (please do not change any INFO fields).

Longer term (ETA a few weeks), #101 will address these issues and will provide a much more robust way of importing various VCF files. Thanks for bearing with us in the meantime!

I also noticed that one of your VCF files has ANN info field. We recently added native annotation support (it's still experimental), but once the new image is released (which includes the fix for the Number=A fields), you should be able to try with --annotation_fields ANN to split up the ANN field into separate columns. We'd appreciate your feedback on the annotation parsing if/when you try it out :)

nmousavi commented 6 years ago

New image with fix PR #123 is released.

andrewelamb commented 6 years ago

Thanks for the help!

the --allow_malformed_records and the fix in issue 122 fixed most of the issues I was having.

In regard to the "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10" issue, the VDB and RPB fields were either floats, or weren't in the file at all.

I forgot to try the annotation fields parameter, but I have a bunch more to add, so I'll try it for those.

arostamianfar commented 6 years ago

Thanks! Please keep us posted on how things go and let us know if you encounter any other issues.