projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
266 stars 111 forks source link

VCF Infinity/NaN values are not handled according to VCF spec #517

Closed dtzeng closed 1 year ago

dtzeng commented 2 years ago

According to the VCF spec (https://samtools.github.io/hts-specs/VCFv4.3.pdf), bottom of page 5:

Float (32-bit IEEE-754, formatted to match one of the regular expressions ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ or ^[-+]?(INF|INFINITY|NAN)$ case insensitively)

The footnote at the bottom of page 5 explicitly calls this out

Note Java’s Double.valueOf is particular about capitalisation, so additional code is needed to parse all VCF infinite/NaN values.

Glow seems to be using only the Double.valueOf to parse, so infinity/nan values in both INFO and genotype fields are not fully parsed correctly.

Sample input 1 (INFO field failure):

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype posterior probabilities in the range 0 to 1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=TEST,Number=1,Type=Float,Description="test">
##contig=<ID=chr1,length=248956422>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    10230   .       AC      A       20.99   PASS    AC=1;TEST=infinity      GT:GP   0/1:NaN

Code:

import io.projectglow.Glow
val sess = Glow.register(spark)
val df = sess.read.format("vcf").option("flattenInfoFields", "true").option("validationStringency", "strict").load(<path to file contents above>)
df.show()

errors with

Caused by: java.lang.IllegalArgumentException: Could not parse INFO field TEST. Exception: For input string: "infinity"
  at io.projectglow.common.HasStringency.raiseValidationError(HasStringency.scala:25)
  at io.projectglow.common.HasStringency.raiseValidationError$(HasStringency.scala:23)
...
Caused by: java.lang.NumberFormatException: For input string: "infinity"
  at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
  at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
  at java.base/java.lang.Double.parseDouble(Double.java:543)
  at scala.collection.immutable.StringLike.toDouble(StringLike.scala:321)
  at scala.collection.immutable.StringLike.toDouble$(StringLike.scala:321)
  at scala.collection.immutable.StringOps.toDouble(StringOps.scala:33)
  at io.projectglow.vcf.LineCtx.parseDouble(VCFLineToInternalRowConverter.scala:407)

Sample input 2 (genotype field failure):

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=1,Type=Float,Description="Genotype posterior probabilities in the range 0 to 1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=TEST,Number=1,Type=Float,Description="test">
##contig=<ID=chr1,length=248956422>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    10230   .       AC      A       20.99   PASS    AC=1;TEST=inf   GT:GP   0/1:-nan

Code (for some reason need to flattenInfoFields=false, for this to fail)

import io.projectglow.Glow
val sess = Glow.register(spark)
val df = sess.read.format("vcf").option("flattenInfoFields", "false").option("validationStringency", "strict").load(<path to file contents above>)

errors with

Caused by: java.lang.IllegalArgumentException: Could not parse FORMAT field GP. Exception: For input string: "-nan"
  at io.projectglow.common.HasStringency.raiseValidationError(HasStringency.scala:25)
  at io.projectglow.common.HasStringency.raiseValidationError$(HasStringency.scala:23)
...
Caused by: java.lang.NumberFormatException: For input string: "-nan"
  at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
  at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
  at java.base/java.lang.Double.parseDouble(Double.java:543)
  at scala.collection.immutable.StringLike.toDouble(StringLike.scala:321)
  at scala.collection.immutable.StringLike.toDouble$(StringLike.scala:321)
  at scala.collection.immutable.StringOps.toDouble(StringOps.scala:33)
  at io.projectglow.vcf.VariantContextToInternalRowConverter.$anonfun$updateFormatField$4(VariantContextToInternalRowConverter.scala:447)