googlegenomics / gcp-variant-transforms

GCP Variant Transforms
Apache License 2.0
136 stars 55 forks source link

No error, but resulting table doesn't work #140

Closed andrewelamb closed 6 years ago

andrewelamb commented 6 years ago

Hello,

I have a set of VCF's that get made into bigquery tables, without error, but aren't working in bigquery.

The resulting schema looks like:

reference_name STRING NULLABLE Reference name.
start_position INTEGER NULLABLE Start position (0-based). Corresponds to the first base of the string of reference bases.
end_position INTEGER NULLABLE End position (0-based). Corresponds to the first base after the last base in the reference allele.
reference_bases STRING NULLABLE Reference bases.
alternate_bases RECORD REPEATED One record for each alternate base (if any).
alternate_bases.alt STRING NULLABLE Alternate base.
names STRING REPEATED Variant names (e.g. RefSNP ID).
quality FLOAT NULLABLE Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.
filter STRING REPEATED List of failed filters (if any) or "PASS" indicating the variant has passed all filters.
call RECORD REPEATED One record for each call.
call.name STRING NULLABLE Name of the call.
call.genotype INTEGER REPEATED Genotype of the call. "-1" is used in cases where the genotype is not called.
call.phaseset STRING NULLABLE Phaseset of the call (if any). "*" is used in cases where the genotype is phased, but no phase set ("PS" in FORMAT) was specified.
call.GQ INTEGER NULLABLE Genotype Quality
call.SP INTEGER NULLABLE Phred-scaled strand bias P-value
call.PL INTEGER REPEATED List of Phred-scaled genotype likelihoods
call.DV INTEGER NULLABLE # high-quality non-reference bases
call.GL FLOAT REPEATED Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)
call.DP INTEGER NULLABLE # high-quality bases

But the details show no data:

Table ID neoepitopes:r1_sorted_variants.eagle_1_TESLA_sorted
Table Size 0 B
Number of Rows 0

What's the best way of troubleshooting this?

arostamianfar commented 6 years ago

Questions:

andrewelamb commented 6 years ago

I did use --runner DataflowRunner, and --allow_malformed_records. I did run this a few eeks ago, so that shouldn't be an issue. It's more likely it seems to me that all the records are malformed, I've had a lot of issues with these particular vcf files so far.

It does look like there are a bunch of errors, is it obvious what the issue is?

2018-03-15 (12:01:53) VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for ... VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for line 11 !!!!edited out!!!! GT:PL:GQ DP=121;VDB=1.444697e-01;RPB=2.266192e-01;AF1=0.5;AC1=1;DP4=22,21,24,21;MQ=60;FQ=225;PV4=1,1,1,1;ACGTNacgtnPLUS=10,0,12,0,0,15,0,13,0,0;ACGTNacgtnMINUS=16,0,10,0,0,20,0,19,0,0 DP=189;DP5=79,100,0,0,1;DP5all=84,103,0,0,2;ACGTNacgtnHQ=0,1,79,0,0,0,0,100,0,0;ACGTNacgtn=0,1,84,0,0,0,1,103,0,0;VAF=0.00;TSR=0;PBINOM=6.52530446799857e-55. timestamp 2018-03-15T19:01:53.199592113Z logger root:filter_variants.py:_is_valid_record severity WARNING worker eagle2teslasorted-03151157-4bd6-harness-v7k6 step FilterVariants/ApplyFilters thread 239:140370905691904

From the console:

Step summary Step name FilterVariants Wall time 0 sec Input collections ReadFromVcf/Read.out0 Elements added 837 Estimated size 590.97 KB Output collections FilterVariants/ApplyFilters.out0 Elements added 0 Estimated size –

arostamianfar commented 6 years ago

Thanks for providing the details! yes, it looks like everything was filtered ("elements added" is zero after filter). Unfortunately, I just noticed that we don't log the error (filed Issue #144), but since the file is small, you can try rerunning without --allow_malformed_record to see what the error is (the pipeline will fail in this case though). Is it the same issue as the 'float' vs 'integer' issue from #119? Could you please confirm whether VDB, RPB, and PBINOM are defined as Float in the header?

P.S. we are making progress on making Variant Transforms more robust by inferring header types and dynamically resolving these type conflicts, which should enable your files to be loaded seamlessly. Please see our design doc and feel free to make comments. (@nmousavi FYI).

andrewelamb commented 6 years ago

So, it looks like PBINOM, is missing from the header, VDB and RPB are defined as floats however. I'll try adding this and see if it works, if not Ill run without --allow_malformed_record and see what the error is.

arostamianfar commented 6 years ago

Thanks for the update. To clarify: are VDB and RPB defined properly in the header as well? If so, they should be visible in the BigQuery schema (the schema doesn't include the fields). Would it possible for you to provide just a few lines of the VCF file (header + a few records)? You may replace sensitive fields with dummy values. We are very curious why the error is happening and may be easier for us to debug if we had access to a similarly formatted data.

andrewelamb commented 6 years ago

Sure, after going back through the vcfs I also noticed that columns were not in the same order as all my other vcfs, so I fixed this as well. Here's an example header and few lines after fixing:

fileformat=VCFv4.1

samtoolsVersion=0.1.19-44428cd

reference=file:///icgc/ngs_share/assemblies/hg19_GRCh37_1000genomes/sequence/1KGRef/hs37d5.fa

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

TUMOR=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

NORMAL=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

Liftover=GRCh38

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

x x x x x 210 PASS . GT:PL:GQ DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0 DP=256;DP5=142,107,0,0,0;DP5all=145,111,0,0,0;ACGTNacgtnHQ=0,142,0,0,0,0,107,0,0,0;ACGTNacgtn=0,145,0,0,0,0,111,0,0,0;VAF=0.00;TSR=0;PBINOM=1.10542957505211e-75 x x x x x 225 PASS . GT:PL:GQ DP=330;VDB=7.020888e-02;RPB=-5.294533e-01;AF1=0.5;AC1=1;DP4=70,40,91,23;MQ=60;FQ=225;PV4=0.069,1,1,1;ACGTNacgtnPLUS=46,0,39,0,0,30,0,26,1,0;ACGTNacgtnMINUS=49,0,31,0,0,39,0,43,0,0 DP=313;DP5=142,145,0,0,0;DP5all=156,156,0,0,1;ACGTNacgtnHQ=0,0,142,0,0,0,0,145,0,0;ACGTNacgtn=0,0,156,0,0,0,0,156,1,0;VAF=0.00;TSR=0;PBINOM=4.02152936677193e-87

andrewelamb commented 6 years ago

So it looks like adding the PBINOM header and reordering the column headers didn't fix this issue, I'll rerun without --allow_malformed_record

arostamianfar commented 6 years ago

Thanks for providing the example! I now know what's going on :)

The format of the VCF file is pretty strange and, as far as I can tell, is not compatible with the VCF spec. Particularly, the fields under TUMOR and NORMAL must conform with the spec defined by FORMAT (i.e. given that the format is defined as GT:PL:GQ the value must be something like 0/0:1,2,3:200). However, the provided values are more like INFO fields (with = signs). The definition of the fields are defined under NORMAL and TUMOR in the header (using ##NORMAL and ##TUMOR) and these are not standard VCF representations, which are basically ignored by the parser. In short, we are unfortunately not able to parse the VCF file in this format.

The current solution is to rewrite the VCF file to conform to the spec, which essentially means rewriting the header and the fields as:

Questions:

@deflaux and @mbookman in case they have seen this format before.

andrewelamb commented 6 years ago

So to give some background we're running a kaggle-like challenge:

http://dreamchallenges.org/sagesynapse/

The participants don't have to provide their pipeline details, but some did. I can see if this team was one of those. These are from the first round where we didn't set any rules on the submitted vcfs, but we've since imposed some requirements that solved most of these issues.

andrewelamb commented 6 years ago

Here are the errors I'm getting running without --allow_malformed_record:

[1] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0'"
[2] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=148;VDB=1.892723e-01;RPB=2.858800e-01;AF1=0.5;AC1=1;DP4=59,11,26,5;MQ=60;FQ=225;PV4=0.85,1,1,0.036;ACGTNacgtnPLUS=15,0,28,0,0,9,0,21,0,0;ACGTNacgtnMINUS=11,0,31,0,0,6,0,22,0,0'" [3] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=2456;VDB=1.682483e-01;AF1=1;AC1=2;DP4=0,0,1104,412;MQ=60;FQ=-282;ACGTNacgtnPLUS=0,0,0,0,0,0,0,0,0,0;ACGTNacgtnMINUS=1109,0,0,0,0,1305,0,0,0,0'"
[4] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=190;VDB=1.407824e-01;RPB=-9.604686e-01;AF1=0;AC1=0;DP4=97,23,8,2;MQ=60;FQ=-134;PV4=0.79,1,1,0.41;ACGTNacgtnPLUS=5,0,54,0,0,3,0,39,0,0;ACGTNacgtnMINUS=3,0,43,0,0,4,0,32,0,0'"

arostamianfar commented 6 years ago

I see. Yes, these failures are expected as the parser is trying to parse the "GT" field (which should be a list of integers) and is finding the long string with DP=....

Unfortunately, I can't see any way to load these VCF files to BigQuery without reformatting them to conform to the spec. And since it doesn't look like there is even a spec for this type of VCF, we can't change our parser to load it.

You may use the VCF validator tool to ensure the VCF files are valid: https://github.com/ebivariation/vcf-validator

arostamianfar commented 6 years ago

To add: if you decide to write a script to reformat the VCF files, you can use dsub to run pipelines in parallel if you have a large number of these files.