nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.71k stars 621 forks source link

Endpoint url error #4811

Open nttg8100 opened 6 months ago

nttg8100 commented 6 months ago

Bug report

Nextflow file main.nf

#!/usr/bin/env nextflow
params.values = Channel.from(1)
process echoValue {
        publishDir "${params.outdir}/echoValue/", mode: 'copy'
        input:
        val value 
        output:
        path "*_echoValue.txt"
        script:
        """
        echo "Value: $value" > ${value}_echoValue.txt
        """
    }
workflow {
    echoValue(params.values)
}

nextflow.conf

aws {
    accessKey = "***"
    secretKey = "***"
    client {
        endpoint = 'https://s3-hcm-r1.s3cloud.vn'
        s3PathStyleAccess = true
    }
}

command

nextflow run main.nf --outdir s3://project_1

Versions: nextflow: 23.10.1 nf-amazon:2.1.4

Expected behavior and actual behavior

I expected that the file will be uploaded to the s3 bucket similar to the minio s3 server that I tested successfully on minio image with endpoint="http://localhost:9000".

I tested with the aws cli command that showed I have permission to put objects. The error showed that it failed to parse the endpoint with no additional information.

com.amazonaws.SdkClientException: Unable to execute HTTP request: s3.hcm.amazonaws.com

I think it is failed because the nextflow plugin nf-amazon 2.1.4 or its dependencies SDK provided by the AWS failed to parsing the endpoint.

Steps to reproduce the problem

I can not provide the endpoint url and its credentials with specific patterns like above.

Program output

N E X T F L O W  ~  version 23.10.1
Launching `tmp/main.nf` [jovial_boltzmann] DSL2 - revision: 7325aaaf63
executor >  local (1)
[98/0715e2] process > echoValue (1) [  0%] 0 of 1
ERROR ~ Error executing process > 'echoValue (1)'

Caused by:
  s3.hcm.amazonaws.com

 -- Check '.nextflow.log' file for details

Environment

Additional context

Is there any documents for recompiling the plugin nf-amazon ? I tried to compile the nextflow after modification of the nf-amazon plugin but it created a build folder structure that is quite different from the nf-amazon@2.1.4 downloaded by nextflow. I cloned the nextflow repo using tag v23.10.1, then ran

make compile
rjb32 commented 3 months ago

Can be reproduced with a custom S3 endpoint hosted on Scaleway.

aws {
    accessKey = 'ACCESSKEY'
    secretKey = 'SECRETKEY'
    region = 'fr-par'
    client {
        endpoint = 'https://s3.fr-par.scw.cloud'
        protocol = 'https'
        s3PathStyleAccess = true
    }
}
rjb32 commented 3 months ago

This is very important to be able to use private S3 implementations in Europe when you are analyzing data from hospitals that are forbidden to use AWS services for GDPR and regulatory reasons.

rjb32 commented 3 months ago

The issue is that at some point something adds back the "amazonaws.com" suffix instead of using the custom S3 endpoint URI provided.

./launch.sh -trace nextflow run ../../hello.nf -work-dir s3://turing/test

The stack trace is as follows:

com.amazonaws.SdkClientException: Unable to execute HTTP request: s3.fr-par.amazonaws.com
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1219)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1165)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5558)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
    at com.amazonaws.services.s3.AmazonS3Client.access$300(AmazonS3Client.java:423)
    at com.amazonaws.services.s3.AmazonS3Client$PutObjectStrategy.invokeServiceCall(AmazonS3Client.java:6639)
    at com.amazonaws.services.s3.AmazonS3Client.uploadObject(AmazonS3Client.java:1892)
    at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1852)
    at nextflow.cloud.aws.nio.S3Client.putObject(S3Client.java:209)
    at nextflow.cloud.aws.nio.S3FileSystemProvider.createDirectory(S3FileSystemProvider.java:492)
    at java.base/java.nio.file.Files.createDirectory(Files.java:700)
    at java.base/java.nio.file.Files.createAndCheckIsDirectory(Files.java:807)
    at java.base/java.nio.file.Files.createDirectories(Files.java:753)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.extension.FilesEx.mkdirs(FilesEx.groovy:493)
    at nextflow.Session.init(Session.groovy:406)
    at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:129)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
    at nextflow.cli.Launcher.run(Launcher.groovy:503)
    at nextflow.cli.Launcher.main(Launcher.groovy:657)
Caused by: java.net.UnknownHostException: s3.fr-par.amazonaws.com
    at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
    at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
    at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
    at com.amazonaws.http.conn.$Proxy27.connect(Unknown Source)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
    ... 25 common frames omitted

We can look at what's happening in nextflow.cloud.aws.nio.S3Client.putObject(S3Client.java:209) which is at the boundary of nextflow package and going into AWS SDK.

    public PutObjectResult putObject(String bucket, String keyName, InputStream inputStream, ObjectMetadata metadata, List<Tag> tags, String contentType) {
        PutObjectRequest req = new PutObjectRequest(bucket, keyName, inputStream, metadata);
        if( cannedAcl != null ) {
            req.withCannedAcl(cannedAcl);
        }
        if( tags != null && tags.size()>0 ) {
            req.setTagging(new ObjectTagging(tags));
        }
        if( kmsKeyId != null ) {
            req.withSSEAwsKeyManagementParams( new SSEAwsKeyManagementParams(kmsKeyId) );
        }
        if( storageEncryption!=null ) {
            metadata.setSSEAlgorithm(storageEncryption.toString());
        }
        if( contentType!=null ) {
            metadata.setContentType(contentType);
        }
        if( log.isTraceEnabled() ) {
            log.trace("S3 PutObject request {}", req);
        }
        return client.putObject(req);
    }

The exception is raised at the last line, by the call to the AWS SDK client.putObject(req). I did a little experiment to determine if the S3 client configuration is already wrong at this point, by trying a few calls to the S3 SDK:

    public PutObjectResult putObject(String bucket, String keyName, InputStream inputStream, ObjectMetadata metadata, List<Tag> tags, String contentType) {
        PutObjectRequest req = new PutObjectRequest(bucket, keyName, inputStream, metadata);
        if( cannedAcl != null ) {
            req.withCannedAcl(cannedAcl);
        }
        if( tags != null && tags.size()>0 ) {
            req.setTagging(new ObjectTagging(tags));
        }
        if( kmsKeyId != null ) {
            req.withSSEAwsKeyManagementParams( new SSEAwsKeyManagementParams(kmsKeyId) );
        }
        if( storageEncryption!=null ) {
            metadata.setSSEAlgorithm(storageEncryption.toString());
        }
        if( contentType!=null ) {
            metadata.setContentType(contentType);
        }
        if( log.isTraceEnabled() ) {
            log.trace("S3 PutObject request {}", req);
        }

                for (Bucket b : client.listBuckets()) {
                    System.out.println("bucket "+b.getName());
                }

        return client.putObject(req);
    }

We can see that the accessible buckets are in fact correctly listed on standard output:

bucket lucl
bucket martina
bucket maxime
bucket turing

So we can conclude that the AWS S3 client can actually access the custom S3 endpoint and gets correct answers in principle, at least enough to be able to list buckets. How strange! Let's try now to list objects inside a bucket, at the same point in the code.

for (S3ObjectSummary obj : client.listObjects("turing", "db").getObjectSummaries()) {
    System.out.println("object "+obj.getKey());
}

This time it fails and we get the exception raised inside AWS SDK coming from the listObjects method.

com.amazonaws.SdkClientException: Unable to execute HTTP request: s3.fr-par.amazonaws.com
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1219)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1165)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5558)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
    at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:950)
    at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:915)
    at nextflow.cloud.aws.nio.S3Client.putObject(S3Client.java:210)
    at nextflow.cloud.aws.nio.S3FileSystemProvider.createDirectory(S3FileSystemProvider.java:492)
    at java.base/java.nio.file.Files.createDirectory(Files.java:700)
    at java.base/java.nio.file.Files.createAndCheckIsDirectory(Files.java:807)
    at java.base/java.nio.file.Files.createDirectories(Files.java:753)
    at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
    at nextflow.extension.FilesEx.mkdirs(FilesEx.groovy:493)
    at nextflow.Session.init(Session.groovy:406)
    at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:129)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
    at nextflow.cli.Launcher.run(Launcher.groovy:503)
    at nextflow.cli.Launcher.main(Launcher.groovy:657)
Caused by: java.net.UnknownHostException: s3.fr-par.amazonaws.com
    at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
    at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
    at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
    at com.amazonaws.http.conn.$Proxy27.connect(Unknown Source)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
    ... 23 common frames omitted

Conclusion

my guess at this point is that it is a bug inside AWS SDK because the S3 client appears to be well configured, with the right endpoint URI and works for just listing the buckets and for any query that does not attempt to read or write objects inside buckets. Some piece inside AWS SDK must overwrite parts of the endpoint URI for some reason.

Why not AWS SDK > 2?

Looking at the Gradle dependencies, question: is there a particular reason why Nextflow still uses AWS SDK 1.12.70 although it is clearly said that it is deprecated and we are at AWS SDK > 2 now?

bentsherman commented 3 months ago

@rjb32 thanks for the triage. SDK v2 is on our roadmap but we just haven't gotten to it yet. It is not a trivial change.

bentsherman commented 3 months ago

See #4741