andrewelamb commented 6 years ago

Hello,

I have a set of VCF's that get made into bigquery tables, without error, but aren't working in bigquery.

The resulting schema looks like:

reference_name	STRING	NULLABLE	Reference name.
start_position	INTEGER	NULLABLE	Start position (0-based). Corresponds to the first base of the string of reference bases.
end_position	INTEGER	NULLABLE	End position (0-based). Corresponds to the first base after the last base in the reference allele.
reference_bases	STRING	NULLABLE	Reference bases.
alternate_bases	RECORD	REPEATED	One record for each alternate base (if any).
alternate_bases.alt	STRING	NULLABLE	Alternate base.
names	STRING	REPEATED	Variant names (e.g. RefSNP ID).
quality	FLOAT	NULLABLE	Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.
filter	STRING	REPEATED	List of failed filters (if any) or "PASS" indicating the variant has passed all filters.
call	RECORD	REPEATED	One record for each call.
call.name	STRING	NULLABLE	Name of the call.
call.genotype	INTEGER	REPEATED	Genotype of the call. "-1" is used in cases where the genotype is not called.
call.phaseset	STRING	NULLABLE	Phaseset of the call (if any). "*" is used in cases where the genotype is phased, but no phase set ("PS" in FORMAT) was specified.
call.GQ	INTEGER	NULLABLE	Genotype Quality
call.SP	INTEGER	NULLABLE	Phred-scaled strand bias P-value
call.PL	INTEGER	REPEATED	List of Phred-scaled genotype likelihoods
call.DV	INTEGER	NULLABLE	# high-quality non-reference bases
call.GL	FLOAT	REPEATED	Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)
call.DP	INTEGER	NULLABLE	# high-quality bases

But the details show no data:

Table ID	neoepitopes:r1_sorted_variants.eagle_1_TESLA_sorted
Table Size	0 B
Number of Rows	0

What's the best way of troubleshooting this?

arostamianfar commented 6 years ago

Questions:

Did you run the pipeline with --runner DataflowRunner? Otherwise, it uses DirectRunner, which for some reason does not immediately update the table summary page (it takes an hour or so). You can, however, query the table immediately and you should see the results.
Did you run with --allow_malformed_records? It is possible that all records were considered "malformed" by the pipeline. You can look at the logs to see the records that were ignored. If using DataflowRunner, you can navigate to the Dataflow Console and click on "FilterVariants" step and see whether there are any warning logs about records being ignored.

andrewelamb commented 6 years ago

I did use --runner DataflowRunner, and --allow_malformed_records. I did run this a few eeks ago, so that shouldn't be an issue. It's more likely it seems to me that all the records are malformed, I've had a lot of issues with these particular vcf files so far.

It does look like there are a bunch of errors, is it obvious what the issue is?

2018-03-15 (12:01:53) VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for ... VCF record read failed in gs://r1_debugged_sorted_variants_syn11811333/eagle_2_TESLA_sorted.vcf for line 11 !!!!edited out!!!! GT:PL:GQ DP=121;VDB=1.444697e-01;RPB=2.266192e-01;AF1=0.5;AC1=1;DP4=22,21,24,21;MQ=60;FQ=225;PV4=1,1,1,1;ACGTNacgtnPLUS=10,0,12,0,0,15,0,13,0,0;ACGTNacgtnMINUS=16,0,10,0,0,20,0,19,0,0 DP=189;DP5=79,100,0,0,1;DP5all=84,103,0,0,2;ACGTNacgtnHQ=0,1,79,0,0,0,0,100,0,0;ACGTNacgtn=0,1,84,0,0,0,1,103,0,0;VAF=0.00;TSR=0;PBINOM=6.52530446799857e-55. timestamp 2018-03-15T19:01:53.199592113Z logger root:filter_variants.py:_is_valid_record severity WARNING worker eagle2teslasorted-03151157-4bd6-harness-v7k6 step FilterVariants/ApplyFilters thread 239:140370905691904

From the console:

Step summary Step name FilterVariants Wall time 0 sec Input collections ReadFromVcf/Read.out0 Elements added 837 Estimated size 590.97 KB Output collections FilterVariants/ApplyFilters.out0 Elements added 0 Estimated size –

arostamianfar commented 6 years ago

Thanks for providing the details! yes, it looks like everything was filtered ("elements added" is zero after filter). Unfortunately, I just noticed that we don't log the error (filed Issue #144), but since the file is small, you can try rerunning without --allow_malformed_record to see what the error is (the pipeline will fail in this case though). Is it the same issue as the 'float' vs 'integer' issue from #119? Could you please confirm whether VDB, RPB, and PBINOM are defined as Float in the header?

P.S. we are making progress on making Variant Transforms more robust by inferring header types and dynamically resolving these type conflicts, which should enable your files to be loaded seamlessly. Please see our design doc and feel free to make comments. (@nmousavi FYI).

andrewelamb commented 6 years ago

So, it looks like PBINOM, is missing from the header, VDB and RPB are defined as floats however. I'll try adding this and see if it works, if not Ill run without --allow_malformed_record and see what the error is.

arostamianfar commented 6 years ago

Thanks for the update. To clarify: are VDB and RPB defined properly in the header as well? If so, they should be visible in the BigQuery schema (the schema doesn't include the fields). Would it possible for you to provide just a few lines of the VCF file (header + a few records)? You may replace sensitive fields with dummy values. We are very curious why the error is happening and may be easier for us to debug if we had access to a similarly formatted data.

andrewelamb commented 6 years ago

Sure, after going back through the vcfs I also noticed that columns were not in the same order as all my other vcfs, so I fixed this as well. Here's an example header and few lines after fixing:

fileformat=VCFv4.1

samtoolsVersion=0.1.19-44428cd

reference=file:///icgc/ngs_share/assemblies/hg19_GRCh37_1000genomes/sequence/1KGRef/hs37d5.fa

contig=

TUMOR=

NORMAL=

FORMAT=

Liftover=GRCh38

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

x x x x x 210 PASS . GT:PL:GQ DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0 DP=256;DP5=142,107,0,0,0;DP5all=145,111,0,0,0;ACGTNacgtnHQ=0,142,0,0,0,0,107,0,0,0;ACGTNacgtn=0,145,0,0,0,0,111,0,0,0;VAF=0.00;TSR=0;PBINOM=1.10542957505211e-75 x x x x x 225 PASS . GT:PL:GQ DP=330;VDB=7.020888e-02;RPB=-5.294533e-01;AF1=0.5;AC1=1;DP4=70,40,91,23;MQ=60;FQ=225;PV4=0.069,1,1,1;ACGTNacgtnPLUS=46,0,39,0,0,30,0,26,1,0;ACGTNacgtnMINUS=49,0,31,0,0,39,0,43,0,0 DP=313;DP5=142,145,0,0,0;DP5all=156,156,0,0,1;ACGTNacgtnHQ=0,0,142,0,0,0,0,145,0,0;ACGTNacgtn=0,0,156,0,0,0,0,156,1,0;VAF=0.00;TSR=0;PBINOM=4.02152936677193e-87

andrewelamb commented 6 years ago

So it looks like adding the PBINOM header and reordering the column headers didn't fix this issue, I'll rerun without --allow_malformed_record

arostamianfar commented 6 years ago

Thanks for providing the example! I now know what's going on :)

The format of the VCF file is pretty strange and, as far as I can tell, is not compatible with the VCF spec. Particularly, the fields under TUMOR and NORMAL must conform with the spec defined by FORMAT (i.e. given that the format is defined as GT:PL:GQ the value must be something like 0/0:1,2,3:200). However, the provided values are more like INFO fields (with = signs). The definition of the fields are defined under NORMAL and TUMOR in the header (using ##NORMAL and ##TUMOR) and these are not standard VCF representations, which are basically ignored by the parser. In short, we are unfortunately not able to parse the VCF file in this format.

The current solution is to rewrite the VCF file to conform to the spec, which essentially means rewriting the header and the fields as:

The header fields with ##TUMOR and ##NORMAL should be replaced with ##FORMAT.
The FORMAT field should be rewritten to contain actual keys (i.e. instead of GT:PL:GQ it should be DP:VDP:RPB:...).
The data for each sample must be formatted to replace key=value with just value (the key is defined under FORMAT).

Questions:

Would it be possible to let us know which tool was used to generate the VCF files?
Are you aware of a spec that defines VCF files in this manner? We have not seen VCF files like this so far, so I'm wondering if it's a customization on the spec that we are not aware of.

@deflaux and @mbookman in case they have seen this format before.

andrewelamb commented 6 years ago

So to give some background we're running a kaggle-like challenge:

http://dreamchallenges.org/sagesynapse/

The participants don't have to provide their pipeline details, but some did. I can see if this team was one of those. These are from the first round where we didn't set any rules on the submitted vcfs, but we've since imposed some requirements that solved most of these issues.

andrewelamb commented 6 years ago

Here are the errors I'm getting running without --allow_malformed_record:

[1] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=245;VDB=4.073709e-01;RPB=3.592296e-01;AF1=0.5;AC1=1;DP4=98,35,35,10;MQ=60;FQ=213;PV4=1,1,1,1;ACGTNacgtnPLUS=0,40,0,18,0,0,28,0,12,0;ACGTNacgtnMINUS=0,58,0,17,0,0,46,0,16,0'"
[2] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=148;VDB=1.892723e-01;RPB=2.858800e-01;AF1=0.5;AC1=1;DP4=59,11,26,5;MQ=60;FQ=225;PV4=0.85,1,1,0.036;ACGTNacgtnPLUS=15,0,28,0,0,9,0,21,0,0;ACGTNacgtnMINUS=11,0,31,0,0,6,0,22,0,0'" [3] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=2456;VDB=1.682483e-01;AF1=1;AC1=2;DP4=0,0,1104,412;MQ=60;FQ=-282;ACGTNacgtnPLUS=0,0,0,0,0,0,0,0,0,0;ACGTNacgtnMINUS=1109,0,0,0,0,1305,0,0,0,0'"
[4] "ValueError: Invalid record in VCF file. Error: invalid literal for int() with base 10: 'DP=190;VDB=1.407824e-01;RPB=-9.604686e-01;AF1=0;AC1=0;DP4=97,23,8,2;MQ=60;FQ=-134;PV4=0.79,1,1,0.41;ACGTNacgtnPLUS=5,0,54,0,0,3,0,39,0,0;ACGTNacgtnMINUS=3,0,43,0,0,4,0,32,0,0'"

arostamianfar commented 6 years ago

I see. Yes, these failures are expected as the parser is trying to parse the "GT" field (which should be a list of integers) and is finding the long string with DP=....

Unfortunately, I can't see any way to load these VCF files to BigQuery without reformatting them to conform to the spec. And since it doesn't look like there is even a spec for this type of VCF, we can't change our parser to load it.

You may use the VCF validator tool to ensure the VCF files are valid: https://github.com/ebivariation/vcf-validator

arostamianfar commented 6 years ago

To add: if you decide to write a script to reformat the VCF files, you can use dsub to run pipelines in parallel if you have a large number of these files.

googlegenomics / gcp-variant-transforms

No error, but resulting table doesn't work #140

fileformat=VCFv4.1

samtoolsVersion=0.1.19-44428cd

reference=file:///icgc/ngs_share/assemblies/hg19_GRCh37_1000genomes/sequence/1KGRef/hs37d5.fa

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=