Float (32-bit IEEE-754, formatted to match one of the regular expressions ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ or ^[-+]?(INF|INFINITY|NAN)$ case insensitively)
The footnote at the bottom of page 5 explicitly calls this out
Note Java’s Double.valueOf is particular about capitalisation, so additional code is needed to parse all VCF infinite/NaN values.
Glow seems to be using only the Double.valueOf to parse, so infinity/nan values in both INFO and genotype fields are not fully parsed correctly.
Sample input 1 (INFO field failure):
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype posterior probabilities in the range 0 to 1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=TEST,Number=1,Type=Float,Description="test">
##contig=<ID=chr1,length=248956422>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 10230 . AC A 20.99 PASS AC=1;TEST=infinity GT:GP 0/1:NaN
Code:
import io.projectglow.Glow
val sess = Glow.register(spark)
val df = sess.read.format("vcf").option("flattenInfoFields", "true").option("validationStringency", "strict").load(<path to file contents above>)
df.show()
errors with
Caused by: java.lang.IllegalArgumentException: Could not parse INFO field TEST. Exception: For input string: "infinity"
at io.projectglow.common.HasStringency.raiseValidationError(HasStringency.scala:25)
at io.projectglow.common.HasStringency.raiseValidationError$(HasStringency.scala:23)
...
Caused by: java.lang.NumberFormatException: For input string: "infinity"
at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.base/java.lang.Double.parseDouble(Double.java:543)
at scala.collection.immutable.StringLike.toDouble(StringLike.scala:321)
at scala.collection.immutable.StringLike.toDouble$(StringLike.scala:321)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:33)
at io.projectglow.vcf.LineCtx.parseDouble(VCFLineToInternalRowConverter.scala:407)
Sample input 2 (genotype field failure):
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=1,Type=Float,Description="Genotype posterior probabilities in the range 0 to 1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=TEST,Number=1,Type=Float,Description="test">
##contig=<ID=chr1,length=248956422>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 10230 . AC A 20.99 PASS AC=1;TEST=inf GT:GP 0/1:-nan
Code (for some reason need to flattenInfoFields=false, for this to fail)
import io.projectglow.Glow
val sess = Glow.register(spark)
val df = sess.read.format("vcf").option("flattenInfoFields", "false").option("validationStringency", "strict").load(<path to file contents above>)
errors with
Caused by: java.lang.IllegalArgumentException: Could not parse FORMAT field GP. Exception: For input string: "-nan"
at io.projectglow.common.HasStringency.raiseValidationError(HasStringency.scala:25)
at io.projectglow.common.HasStringency.raiseValidationError$(HasStringency.scala:23)
...
Caused by: java.lang.NumberFormatException: For input string: "-nan"
at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.base/java.lang.Double.parseDouble(Double.java:543)
at scala.collection.immutable.StringLike.toDouble(StringLike.scala:321)
at scala.collection.immutable.StringLike.toDouble$(StringLike.scala:321)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:33)
at io.projectglow.vcf.VariantContextToInternalRowConverter.$anonfun$updateFormatField$4(VariantContextToInternalRowConverter.scala:447)
According to the VCF spec (https://samtools.github.io/hts-specs/VCFv4.3.pdf), bottom of page 5:
The footnote at the bottom of page 5 explicitly calls this out
Glow seems to be using only the Double.valueOf to parse, so infinity/nan values in both INFO and genotype fields are not fully parsed correctly.
Sample input 1 (INFO field failure):
Code:
errors with
Sample input 2 (genotype field failure):
Code (for some reason need to flattenInfoFields=false, for this to fail)
errors with