nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 623 forks source link

Nextflow doesn't allow underscores in GCP bucket name #1527

Closed daudn closed 4 years ago

daudn commented 4 years ago

Bug report

Trying to download files from a GCP Storage bucket with underscores doesn't work and Nextflow throws an error: java.net.URISyntaxException: Illegal character in hostname where the illegal character is an underscore _

Steps to reproduce the problem

#!/usr/bin/env nextflow
import com.google.cloud.storage.contrib.nio.CloudStorageFileSystem

Path path = CloudStorageFileSystem.forBucket('tgs_ext_archive').getPath('file.txt')

String gcsString = "gs://" + path.bucket() + "/data" + path.toAbsolutePath();

tgs_root_chan = Channel.fromPath(gcsString)

process get_and_untar{
    machineType 'g1-small'
    container 'python:3'

    input:
    file mine from tgs_root_chan.collect()

    script:
    """
    ls
    """
}

Program output (immediate)

N E X T F L O W  ~  version 20.01.0
Launching `nextflow/make_untar.nf` [festering_edison] - revision: 268ef619fb
gs://tgs_ext_archive/data/TGSDEV150729*.tar.bz2
[-        ] process > get_and_untar -
java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/

Logs

 Version: 20.01.0 build 5264
  Created: 12-02-2020 10:14 UTC
  System: Linux 4.15.0-1055-gcp
  Runtime: Groovy 2.5.8 on OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08
  Encoding: UTF-8 (UTF-8)
  Process: 27985@tgs-controller [10.154.0.56]
  CPUs: 2 - Mem: 7.3 GB (6.2 GB) - Swap: 0 (0)
Mar-12 09:51:14.298 [main] DEBUG nextflow.Session - Work-dir: gs://nextflow-text-bucket/ [ext2/ext3]
Mar-12 09:51:14.299 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/daudn/tgs_workflow/nextflow/bin
Mar-12 09:51:14.388 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Mar-12 09:51:14.391 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Mar-12 09:51:14.588 [main] DEBUG nextflow.Session - Session start invoked
Mar-12 09:51:14.830 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Mar-12 09:51:14.887 [main] DEBUG nextflow.Session - Workflow process names [dsl1]: get_and_untar
Mar-12 09:51:14.945 [PathVisitor-1] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /data/; pattern: TGSDEV150729*.tar.bz2; options: [:]
Mar-12 09:51:15.315 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: google-lifesciences
Mar-12 09:51:15.316 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'google-lifesciences'
Mar-12 09:51:15.328 [main] DEBUG nextflow.executor.Executor - [warm up] executor > google-lifesciences
Mar-12 09:51:15.350 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'google-lifesciences' > capacity: 1000; pollInterval: 10s; dumpInterval: 5m
Mar-12 09:51:15.404 [main] DEBUG n.c.g.l.GoogleLifeSciencesExecutor - Google Life Science config=GoogleLifeSciencesConfig(project:bioinformatics-playground, zones:[], regions:[europe-west2], preemptible:false, remoteBinDir:null, location:europe-west2, disableBinDir:false, bootDiskSize:20 GB, sshDaemon:false, sshImage:gcr.io/cloud-genomics-pipelines/tools, debugMode:null, copyImage:google/cloud-sdk:alpine)
Mar-12 09:51:15.952 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > get_and_untar -- maxForks: 2; blocking: false
Mar-12 09:51:16.057 [main] DEBUG nextflow.script.ScriptRunner - > Await termination
Mar-12 09:51:16.057 [main] DEBUG nextflow.Session - Session await
Mar-12 09:51:16.312 [PathVisitor-1] ERROR nextflow.Channel - java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
java.lang.AssertionError: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
        at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:356)
        at com.google.cloud.storage.contrib.nio.CloudStoragePseudoDirectoryAttributes.<init>(CloudStoragePseudoDirectoryAttributes.java:31)
        at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.readAttributes(CloudStorageFileSystemProvider.java:831)
        at java.nio.file.Files.readAttributes(Files.java:1737)
        at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
        at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
        at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
        at java.nio.file.Files.walkFileTree(Files.java:2662)
        at nextflow.file.FileHelper.visitFiles(FileHelper.groovy:742)
        at nextflow.file.PathVisitor.pathImpl(PathVisitor.groovy:162)
        at nextflow.file.PathVisitor.applyGlobPattern0(PathVisitor.groovy:130)
        at nextflow.file.PathVisitor.apply(PathVisitor.groovy:68)
        at nextflow.file.PathVisitor$_applyAsync_closure1.doCall(PathVisitor.groovy:77)
        at nextflow.file.PathVisitor$_applyAsync_closure1.call(PathVisitor.groovy)
        at groovy.lang.Closure.run(Closure.java:486)
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
        at java.net.URI$Parser.fail(URI.java:2848)
        at java.net.URI$Parser.parseHostname(URI.java:3387)
        at java.net.URI$Parser.parseServer(URI.java:3236)
        at java.net.URI$Parser.parseAuthority(URI.java:3155)
        at java.net.URI$Parser.parseHierarchical(URI.java:3097)
        at java.net.URI$Parser.parse(URI.java:3053)
        at java.net.URI.<init>(URI.java:673)
        at java.net.URI.<init>(URI.java:774)
        at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:354)
        ... 20 common frames omitted
Mar-12 09:51:16.361 [main] DEBUG nextflow.Session - Session await > all process finished
Mar-12 09:51:16.418 [PathVisitor-1] DEBUG nextflow.Session - Session aborted -- Cause: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
Mar-12 09:51:16.448 [main] DEBUG nextflow.Session - Session await > all barriers passed
Mar-12 09:51:16.470 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=0; peakCpus=0; peakMemory=0; ]
Mar-12 09:51:16.634 [main] DEBUG nextflow.CacheDB - Closing CacheDB done
Mar-12 09:51:16.661 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Environment

pditommaso commented 4 years ago

Looking the error stack trace this looks a problem with the Google SDK for Google Storage. Dont' think it can be done much on NF side

java.lang.AssertionError: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
        at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:356)
        at com.google.cloud.storage.contrib.nio.CloudStoragePseudoDirectoryAttributes.<init>(CloudStoragePseudoDirectoryAttributes.java:31)
        at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.readAttributes(CloudStorageFileSystemProvider.java:831)
:
pditommaso commented 4 years ago

I've asked a comment to google folks, this is their reply:

Ah, this looks like an instance of Java's long-time dislike of underscore characters in hostnames (technically a violation of an RFC).

Some related reading: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8180809 https://stackoverflow.com/questions/28568188/java-net-uri-get-host-with-underscores "

Also:

FWIW, this is why we recommend against using underscores in bucket names. See: https://cloud.google.com/storage/docs/naming#requirements Also, for DNS compliance and future compatibility, you should not use underscores (_)

The code is here, where it tries to create a new URI instance with a gs:// URI: https://github.com/googleapis/java-storage-nio/blob/master/google-cloud-nio/src/main/java/com/google/cloud/storage/contrib/nio/CloudStoragePath.java#L354

Worth noting that also AWS S3 does not allow _ in the bucket names, quite likely for DNS compliance as well.

Closing this as known issue.

frankyn commented 4 years ago

@pditommaso I was able to use the library with underscored GCS bucket name in the following example:

CloudStorageFileSystem fs = CloudStorageFileSystem.forBucket('anima_frank');
// testfile contains the following:
//               id,other
//               hello,world

tgs_root_chan = Channel.fromPath(fs.getPath('testfile'))
tgs_root_chan.splitCsv(header:true)
    .map{ row-> tuple(row.id, row.other) }
    .println{ it }
pditommaso commented 4 years ago

Hi @frankyn, thanks for commenting on this. I think for a specific file object path i.e. without wildcard it's not even needed Channel.fromPath('gs://anime_frank/some/file') should work.

There's even a test for this https://github.com/nextflow-io/nextflow/blob/a5f1671c040d5a65c939a37931398733f9e1aaf9/modules/nf-google/src/test/nextflow/file/FileHelperGsTest.groovy#L53

The problem is when using a wildcard eg gs://anime_frank/some/file* the FileTreeWalker Java api will try to resolve the bucket as host name, resulting in that error

Mar-12 09:51:16.312 [PathVisitor-1] ERROR nextflow.Channel - java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
java.lang.AssertionError: java.net.URISyntaxException: Illegal character in hostname at index 8: gs://tgs_ext_archive/data/
        at com.google.cloud.storage.contrib.nio.CloudStoragePath.toUri(CloudStoragePath.java:356)
        at com.google.cloud.storage.contrib.nio.CloudStoragePseudoDirectoryAttributes.<init>(CloudStoragePseudoDirectoryAttributes.java:31)
        at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.readAttributes(CloudStorageFileSystemProvider.java:831)
        at java.nio.file.Files.readAttributes(Files.java:1737)
        at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
        at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
        at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
        at java.nio.file.Files.walkFileTree(Files.java:2662)