Closed daudn closed 4 years ago
Looking the error stack trace this looks a problem with the Google SDK for Google Storage. Dont' think it can be done much on NF side
java.lang.AssertionError: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:356)
at com.google.cloud.storage.contrib.nio.CloudStoragePseudoDirectoryAttributes.<init>(CloudStoragePseudoDirectoryAttributes.java:31)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.readAttributes(CloudStorageFileSystemProvider.java:831)
:
I've asked a comment to google folks, this is their reply:
Ah, this looks like an instance of Java's long-time dislike of underscore characters in hostnames (technically a violation of an RFC).
Some related reading: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8180809 https://stackoverflow.com/questions/28568188/java-net-uri-get-host-with-underscores "
Also:
FWIW, this is why we recommend against using underscores in bucket names. See: https://cloud.google.com/storage/docs/naming#requirements Also, for DNS compliance and future compatibility, you should not use underscores (_)
The code is here, where it tries to create a new URI instance with a gs:// URI: https://github.com/googleapis/java-storage-nio/blob/master/google-cloud-nio/src/main/java/com/google/cloud/storage/contrib/nio/CloudStoragePath.java#L354
Worth noting that also AWS S3 does not allow _
in the bucket names, quite likely for DNS compliance as well.
Closing this as known issue.
@pditommaso I was able to use the library with underscored GCS bucket name in the following example:
CloudStorageFileSystem fs = CloudStorageFileSystem.forBucket('anima_frank');
// testfile contains the following:
// id,other
// hello,world
tgs_root_chan = Channel.fromPath(fs.getPath('testfile'))
tgs_root_chan.splitCsv(header:true)
.map{ row-> tuple(row.id, row.other) }
.println{ it }
Hi @frankyn, thanks for commenting on this. I think for a specific file object path i.e. without wildcard it's not even needed Channel.fromPath('gs://anime_frank/some/file')
should work.
There's even a test for this https://github.com/nextflow-io/nextflow/blob/a5f1671c040d5a65c939a37931398733f9e1aaf9/modules/nf-google/src/test/nextflow/file/FileHelperGsTest.groovy#L53
The problem is when using a wildcard eg gs://anime_frank/some/file*
the FileTreeWalker Java api will try to resolve the bucket as host name, resulting in that error
Mar-12 09:51:16.312 [PathVisitor-1] ERROR nextflow.Channel - java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
java.lang.AssertionError: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:356)
at com.google.cloud.storage.contrib.nio.CloudStoragePseudoDirectoryAttributes.<init>(CloudStoragePseudoDirectoryAttributes.java:31)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.readAttributes(CloudStorageFileSystemProvider.java:831)
at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
at java.nio.file.Files.walkFileTree(Files.java:2662)
Bug report
Trying to download files from a GCP Storage bucket with underscores doesn't work and Nextflow throws an error:
java.net.URISyntaxException: Illegal character in hostname
where the illegal character is an underscore_
Steps to reproduce the problem
Program output (immediate)
Logs
Environment