techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
660 stars 33 forks source link

feather file has issue with compression #294

Closed behrica closed 2 years ago

behrica commented 2 years ago

Reading

https://github.com/scicloj/scicloj.ml-tutorials/blob/main/data/tweets_sentiment.feather?raw=true

fails with

Execution error at net.jpountz.lz4.LZ4FrameOutputStream$FLG/validate (LZ4FrameOutputStream.java:362).
Dependent block stream is unsupported (BLOCK_INDEPENDENCE must be set)

I followed the setup instructions for arrow support in TMD.

cnuernber commented 2 years ago

Thanks, will take a look. What compressed this file?

cnuernber commented 2 years ago

Tracking this upstream - https://github.com/lz4/lz4-java/issues/190

cnuernber commented 2 years ago

A temporary (hopefully) solution I am going to try is to use FFI bindings to call into the C library directly. This one is going to be a tough one as the only example of dependent frame compression I can find is the go library.

@behrica - How did you produce this file?

cnuernber commented 2 years ago

The point of the question is is this pathway going to be the standard pathway everyone is using or did you produce this file with some magic set of options that very few other people are going to use?

cnuernber commented 2 years ago

You will have to now also include jna and ensure that liblz4 is on your system which is system-dependent. My recommendation is to avoid dependent block compression on lz4 so if that was a parameter set it to false.

behrica commented 2 years ago

The point of the question is is this pathway going to be the standard pathway everyone is using or did you produce this file with some magic set of options that very few other people are going to use?

Not that I remember.

I think I created it in the simplest possible way from R:

x=readr::read_csv( ...)
arrow::write_feather(x , ...)

It came out while I was working on the file from #292 , and the above was my attempt to get the data into clojure (via feather ...)

cnuernber commented 2 years ago

That was my fear - then these things will be all over the place.