mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

UnknownHostException #223

Closed tomasz-dudziak closed 3 years ago

tomasz-dudziak commented 3 years ago

I am trying to write Parquet to S3 (FlashBlade) using this library, but I am getting an UnknownHostException where it appears to try to connect to my-bucket.my-company-s3-endpoint.com. This is surprising as I would expect it to rather be my-s3-endpoint.com/my-bucket. What is wrong and can the library be configured to use the correct URL? Or could it be because of some other dependencies in my project?

5734 [main] DEBUG com.amazonaws.http.conn.ClientConnectionManagerFactory  - 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
    at com.amazonaws.http.conn.$Proxy25.connect(Unknown Source)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1343)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5445)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5392)
    at com.amazonaws.services.s3.AmazonS3Client.getAcl(AmazonS3Client.java:4051)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketAcl(AmazonS3Client.java:1274)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketAcl(AmazonS3Client.java:1264)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExistV2(AmazonS3Client.java:1402)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExistsV2$2(S3AFileSystem.java:575)
    at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:110)
    at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407)
    at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311)
    at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:286)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExistsV2(S3AFileSystem.java:574)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.doBucketProbing(S3AFileSystem.java:494)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:397)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3414)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:158)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3474)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3442)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58)
    at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:643)
    at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:98)
    at com.github.mjakubowski84.parquet4s.DefaultParquetWriter.<init>(ParquetWriter.scala:139)
    at com.github.mjakubowski84.parquet4s.ParquetWriter$.$anonfun$writerFactory$1(ParquetWriter.scala:128)
    at com.github.mjakubowski84.parquet4s.ParquetWriter$.writeAndClose(ParquetWriter.scala:114)
    at com.mwam.datahub.util.S3$.testWrite(S3.scala:120)
    at com.mwam.datahub.util.S3$.delayedEndpoint$com$mwam$datahub$util$S3$1(S3.scala:51)
    at com.mwam.datahub.util.S3$delayedInit$body.apply(S3.scala:21)
    at scala.Function0.apply$mcV$sp(Function0.scala:34)
    at scala.Function0.apply$mcV$sp$(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App.$anonfun$main$1$adapted(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:389)
    at scala.App.main(App.scala:76)
    at scala.App.main$(App.scala:74)
    at com.mwam.datahub.util.S3$.main(S3.scala:21)
    at com.mwam.datahub.util.S3.main(S3.scala)
Caused by: java.net.UnknownHostException: my-bucket.my-company-s3-endpoint.com
    at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
    at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27)
    at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
    ... 61 more
marcinaylien commented 3 years ago

Connection to AWS is not handled by Parquet4S. Parquet itself relies on hadoop-client for connectivity. Link to the documentation of hadoop-aws is in the README: https://github.com/mjakubowski84/parquet4s#aws-s3.

tomasz-dudziak commented 3 years ago

You're right! Sorted by passing to options hadoopConf with the below setting: hadoopConf.set("fs.s3a.path.style.access", "true")