mjakubowski84 / parquet4s

Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.
https://mjakubowski84.github.io/parquet4s/
MIT License
283 stars 65 forks source link

Writing to S3 on a docker-image #359

Closed ymichels closed 1 month ago

ymichels commented 1 month ago

Hi, I tried using parquet4s to write a parquet file to S3. Locally it worked as expected and wrote the file to the S3. However the same code when I ran it on a docker image - produced the following error:

2024-10-13 18:59:27 Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file" 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3575) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3598) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:171) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3702) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3653) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:555) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:508) 2024-10-13 18:59:27 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:319) 2024-10-13 18:59:27 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:396) 2024-10-13 18:59:27 at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:166) 2024-10-13 18:59:27 at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:147) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.createTmpFileForWrite(S3AFileSystem.java:1538) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory.create(S3ADataBlocks.java:823) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.createBlockIfNeeded(S3ABlockOutputStream.java:237) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.(S3ABlockOutputStream.java:219) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.innerCreateFile(S3AFileSystem.java:2065) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$create$5(S3AFileSystem.java:1960) 2024-10-13 18:59:27 at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) 2024-10-13 18:59:27 at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) 2024-10-13 18:59:27 at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2707) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2726) 2024-10-13 18:59:27 at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:1959) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1231) 2024-10-13 18:59:27 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1208) 2024-10-13 18:59:27 at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:82) 2024-10-13 18:59:27 at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:471) 2024-10-13 18:59:27 at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:403) 2024-10-13 18:59:27 at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:395) 2024-10-13 18:59:27 at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:918) 2024-10-13 18:59:27 at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:192) 2024-10-13 18:59:27 at com.github.mjakubowski84.parquet4s.ParquetWriter$BuilderImpl.build(ParquetWriter.scala:170) 2024-10-13 18:59:27 at com.github.mjakubowski84.parquet4s.ParquetWriter$BuilderImpl.build(ParquetWriter.scala:175) 2024-10-13 18:59:27 at com.github.mjakubowski84.parquet4s.ParquetWriter$BuilderImpl.writeAndClose(ParquetWriter.scala:181) 2024-10-13 18:59:27 at Main$.main(Main.scala:29) 2024-10-13 18:59:27 at Main.main(Main.scala)

No matter how much I tried to configure the options - this error persists. Please respond. If you need more information I'll try to provide it.

mjakubowski84 commented 1 month ago

User issues related to connecting to Hadoop and related storage providers please report to corresponding support pages. Your issue is not related to Parquet4s but to the Hadoop connector or how you use it. I do not provide support for them.