Closed andrewelamb closed 6 years ago
Questions:
--runner DataflowRunner
? Otherwise, it uses DirectRunner, which for some reason does not immediately update the table summary page (it takes an hour or so). You can, however, query the table immediately and you should see the results.--allow_malformed_records
? It is possible that all records were considered "malformed" by the pipeline. You can look at the logs to see the records that were ignored. If using DataflowRunner, you can navigate to the Dataflow Console and click on "FilterVariants" step and see whether there are any warning logs about records being ignored.I did use --runner DataflowRunner, and --allow_malformed_records. I did run this a few eeks ago, so that shouldn't be an issue. It's more likely it seems to me that all the records are malformed, I've had a lot of issues with these particular vcf files so far.
It does look like there are a bunch of errors, is it obvious what the issue is?
2018-03-15 (12:01:53) VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for ... VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for line 11 !!!!edited out!!!! GT:PL:GQ DP=121;VDB=1.444697e-01;RPB=2.266192e-01;AF1=0.5;AC1=1;DP4=22,21,24,21;MQ=60;FQ=225;PV4=1,1,1,1;ACGTNacgtnPLUS=10,0,12,0,0,15,0,13,0,0;ACGTNacgtnMINUS=16,0,10,0,0,20,0,19,0,0 DP=189;DP5=79,100,0,0,1;DP5all=84,103,0,0,2;ACGTNacgtnHQ=0,1,79,0,0,0,0,100,0,0;ACGTNacgtn=0,1,84,0,0,0,1,103,0,0;VAF=0.00;TSR=0;PBINOM=6.52530446799857e-55. timestamp 2018-03-15T19:01:53.199592113Z logger root:filter_variants.py:_is_valid_record severity WARNING worker eagle2teslasorted-03151157-4bd6-harness-v7k6 step FilterVariants/ApplyFilters thread 239:140370905691904
From the console:
Step summary Step name FilterVariants Wall time 0 sec Input collections ReadFromVcf/Read.out0 Elements added 837 Estimated size 590.97 KB Output collections FilterVariants/ApplyFilters.out0 Elements added 0 Estimated size –
Thanks for providing the details! yes, it looks like everything was filtered ("elements added" is zero after filter). Unfortunately, I just noticed that we don't log the error (filed Issue #144), but since the file is small, you can try rerunning without --allow_malformed_record to see what the error is (the pipeline will fail in this case though). Is it the same issue as the 'float' vs 'integer' issue from #119?
Could you please confirm whether VDB
, RPB
, and PBINOM
are defined as Float in the header?
P.S. we are making progress on making Variant Transforms more robust by inferring header types and dynamically resolving these type conflicts, which should enable your files to be loaded seamlessly. Please see our design doc and feel free to make comments. (@nmousavi FYI).
So, it looks like PBINOM, is missing from the header, VDB and RPB are defined as floats however. I'll try adding this and see if it works, if not Ill run without --allow_malformed_record and see what the error is.
Thanks for the update. To clarify: are VDB and RPB defined properly in the header as well? If so, they should be visible in the BigQuery schema (the schema doesn't include the fields). Would it possible for you to provide just a few lines of the VCF file (header + a few records)? You may replace sensitive fields with dummy values. We are very curious why the error is happening and may be easier for us to debug if we had access to a similarly formatted data.
Sure, after going back through the vcfs I also noticed that columns were not in the same order as all my other vcfs, so I fixed this as well. Here's an example header and few lines after fixing:
x x x x x 210 PASS . GT:PL:GQ DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0 DP=256;DP5=142,107,0,0,0;DP5all=145,111,0,0,0;ACGTNacgtnHQ=0,142,0,0,0,0,107,0,0,0;ACGTNacgtn=0,145,0,0,0,0,111,0,0,0;VAF=0.00;TSR=0;PBINOM=1.10542957505211e-75 x x x x x 225 PASS . GT:PL:GQ DP=330;VDB=7.020888e-02;RPB=-5.294533e-01;AF1=0.5;AC1=1;DP4=70,40,91,23;MQ=60;FQ=225;PV4=0.069,1,1,1;ACGTNacgtnPLUS=46,0,39,0,0,30,0,26,1,0;ACGTNacgtnMINUS=49,0,31,0,0,39,0,43,0,0 DP=313;DP5=142,145,0,0,0;DP5all=156,156,0,0,1;ACGTNacgtnHQ=0,0,142,0,0,0,0,145,0,0;ACGTNacgtn=0,0,156,0,0,0,0,156,1,0;VAF=0.00;TSR=0;PBINOM=4.02152936677193e-87
So it looks like adding the PBINOM header and reordering the column headers didn't fix this issue, I'll rerun without --allow_malformed_record
Thanks for providing the example! I now know what's going on :)
The format of the VCF file is pretty strange and, as far as I can tell, is not compatible with the VCF spec. Particularly, the fields under TUMOR and NORMAL must conform with the spec defined by FORMAT (i.e. given that the format is defined as GT:PL:GQ
the value must be something like 0/0:1,2,3:200
). However, the provided values are more like INFO fields (with =
signs).
The definition of the fields are defined under NORMAL
and TUMOR
in the header (using ##NORMAL
and ##TUMOR
) and these are not standard VCF representations, which are basically ignored by the parser. In short, we are unfortunately not able to parse the VCF file in this format.
The current solution is to rewrite the VCF file to conform to the spec, which essentially means rewriting the header and the fields as:
##TUMOR
and ##NORMAL
should be replaced with ##FORMAT
.GT:PL:GQ
it should be DP:VDP:RPB:...
).key=value
with just value
(the key is defined under FORMAT).Questions:
@deflaux and @mbookman in case they have seen this format before.
So to give some background we're running a kaggle-like challenge:
http://dreamchallenges.org/sagesynapse/
The participants don't have to provide their pipeline details, but some did. I can see if this team was one of those. These are from the first round where we didn't set any rules on the submitted vcfs, but we've since imposed some requirements that solved most of these issues.
Here are the errors I'm getting running without --allow_malformed_record:
[1] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0'"
[2] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=148;VDB=1.892723e-01;RPB=2.858800e-01;AF1=0.5;AC1=1;DP4=59,11,26,5;MQ=60;FQ=225;PV4=0.85,1,1,0.036;ACGTNacgtnPLUS=15,0,28,0,0,9,0,21,0,0;ACGTNacgtnMINUS=11,0,31,0,0,6,0,22,0,0'"
[3] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=2456;VDB=1.682483e-01;AF1=1;AC1=2;DP4=0,0,1104,412;MQ=60;FQ=-282;ACGTNacgtnPLUS=0,0,0,0,0,0,0,0,0,0;ACGTNacgtnMINUS=1109,0,0,0,0,1305,0,0,0,0'"
[4] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=190;VDB=1.407824e-01;RPB=-9.604686e-01;AF1=0;AC1=0;DP4=97,23,8,2;MQ=60;FQ=-134;PV4=0.79,1,1,0.41;ACGTNacgtnPLUS=5,0,54,0,0,3,0,39,0,0;ACGTNacgtnMINUS=3,0,43,0,0,4,0,32,0,0'"
I see. Yes, these failures are expected as the parser is trying to parse the "GT" field (which should be a list of integers) and is finding the long string with DP=...
.
Unfortunately, I can't see any way to load these VCF files to BigQuery without reformatting them to conform to the spec. And since it doesn't look like there is even a spec for this type of VCF, we can't change our parser to load it.
You may use the VCF validator tool to ensure the VCF files are valid: https://github.com/ebivariation/vcf-validator
To add: if you decide to write a script to reformat the VCF files, you can use dsub to run pipelines in parallel if you have a large number of these files.
Hello,
I have a set of VCF's that get made into bigquery tables, without error, but aren't working in bigquery.
The resulting schema looks like:
But the details show no data:
What's the best way of troubleshooting this?