qubole / streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Apache License 2.0
97 stars 54 forks source link

Folders as files with $ dollar sign in their name when using s3n #37

Closed levin81 closed 7 years ago

levin81 commented 7 years ago

When using s3n protocol, many "folder" files (sized 0 bytes) are generated along their respective folders with $ in their names, like "all_$folder$".

Some of the folders themselves aren't even created and only these dollar named files are, like "+tmp_$folder$". Even though this is printed in the logs:

[2017-02-01 17:30:12,308] INFO OutputStream for key 'topics/+tmp/all/year=2017/month=02/day=01/hour=17/0d60b9d1-7dcc-468a-b0c4-682609280877_tmp.parquet' writing to tempfile '/tmp/hadoop-root/s3/output-1823693314448547785.tmp' (org.apache.hadoop.fs.s3native.NativeS3FileSystem)

No +tmp directory created :-\ Only the 0 byte file.

Is there an elegant way of stopping the generation of these files? In s3a this doesn't occur but using that seems buggy at the moment (I need to open another issue for this) so I've resorted to s3n.

Thanks

PraveenSeluka commented 7 years ago

I guess there is no easy way to fix this. From this page,

A note about directories. S3 of course has no "native" support for them. The idiom we choose then is: for any directory created by this class, we use an empty object "#{dirpath}_$folder$" as a marker. Further, to interoperate with other S3 tools, we also accept the following: - an object "#{dirpath}/' denoting a directory marker - if there exists any objects with the prefix "#{dirpath}/", then the directory is said to exist - if both a file with the name of a directory and a marker for that directory exists, then the file masks the directory, and the directory is never returned.

levin81 commented 7 years ago

So the fact that there only appears to be a "+tmp_$folder$" file and no folder by itself is fine?

PraveenSeluka commented 7 years ago

How did you check that there was no +tmp folder, there must be one actually.

levin81 commented 7 years ago

I looked in the S3 directory. In the beginning there was only the "+tmp_$folder$" file but after a while (few hours) the directory was created. Nevermind :-)

PraveenSeluka commented 7 years ago

This does not seem like an issue, closing it. Reopen if you see it.