projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

VCF files with spaces in the file name cannot be read #474

Closed arunbhat closed 5 months ago

arunbhat commented 2 years ago

This issue is similar to the issues reported in SPARK-21996 and SPARK-23148. Would this line need to be modified to val hPath = new Path(new URI(path))? (probably also other places too. see the changes for CSV and JSON datasources in this PR

Simple code that fails is below. Note that reading a csv file from the same path works

path="file:///mnt/project/vcfs/v c f/example vcf/PRJEB20654 3 samples 1 partitions 1615 loci.ann.vcf"
df = spark.read.format("vcf").load(path)
df.show()

Py4JJavaError: An error occurred while calling o47.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, ip-172-31-173-125.ec2.internal, executor 0): java.io.FileNotFoundException: File file:/mnt/project/vcfs/v%20c%20f/example%20vcf/PRJEB20654%203%20samples%201%20partitions%201615%20loci.ann.vcf does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
williambrandler commented 2 years ago

Hey @arunbhat ah right yes we assume no special characters such as whitespace

Let me just quickly try your suggested change and see what happens with the circleCI checks... https://github.com/projectglow/glow/pull/475

Yes I expect other places in the codebase will need adjusting and unit tests will have to be written.

For now is it possible to use underscores instead of spaces for paths?

arunbhat commented 2 years ago

For now is it possible to use underscores instead of spaces for paths?

Thanks @williambrandler. Unfortunately the path is not always under our control (our customers use it). And yes for now we have suggested the workaround proposed by you

williambrandler commented 2 years ago

good news is the change you suggested doesn't break any of the tests!

However, yes it does look like there are a bunch of places in the codebase where we will need this fix.

I just wonder if implementing the fix may cause more harm than good. It will inevitably cause problems when using command line tools in linux on the same files.

Another option is to put a tip in the docs not to use special characters.

henrydavidge commented 5 months ago

Resolved