queryverse / CSVFiles.jl

FileIO.jl integration for CSV files
Other
51 stars 13 forks source link

Support compressed files #33

Open davidanthoff opened 6 years ago

davidanthoff commented 6 years ago

Here is one approach, would be nice to support this in some easier way. Not super clear how that would interact with the FileIO story, though...

xgdgsc commented 5 years ago

This is really needed for data analysis.

alejandromerchan commented 5 years ago

This is something sorely needed, but it looks that the work needed on TranscodingStreams.jl isn't happening anytime soon, unfortunately. I eventually found some workarounds for some .zip csv files I needed, but this would make things easier, no doubt.

davidanthoff commented 5 years ago

It turns out that we had support for this for a long time, I just wasn't aware of it :) Try this:

using FileIO
load(File(format"CSV", "foo.csv.gz")) |> DataFrame

So really all that is needed is some documentation about it.

alejandromerchan commented 5 years ago

This doesn't seem to work for .zip files.

File(format"CSV", filepath) 

works and produces an object

File{DataFormat{:CSV}}

But when passing this object to the load function, it fails.

harryscholes commented 5 years ago

Can we do the opposite and save a DataFrame to a gzipped CSV file? If so, I'll add some docs to close this issue.

davidanthoff commented 5 years ago

Not at the moment. I think there are two levels of support we could add: 1) Add support for savestreaming from FileIO.jl, and then it might work with some external compression library 2) Add support to the existing file saving function to compress on save if the filename has a compressed file ending.

harryscholes commented 5 years ago

The following works:

using CSVFiles, DataFrames, GZip

df = DataFrame(a = 1:10)

GZip.open("df.gz", "w") do io
    save(Stream(format"CSV", io), df)
end

load(File(format"CSV", "df.gz"))

I don't mind taking a look into this and tidying it up. There are quite a few gzip packages out there. Is there a preferred one to use? Maybe GZip because it's in the JuliaIO organization?

davidanthoff commented 5 years ago

Cool! And ah, we already have support for saving to streams, I had forgotten about that!

I guess the question is whether we should add support so that one uses File(format"CSV" with a .gz file extension it compresses automatically... I think in some sense we are probably abusing the FileIO design a bit with that, but on the other hand, it would be handy and probably not much harm done?

I think in terms of packages, if https://github.com/bicycle1885/CodecZlib.jl works it would probably be best, because we already use it for the read case in TextParse.jl.

harryscholes commented 5 years ago

I think it would be very handy! I've just implemented a modified version of the _save function that uses CodecZlib's GzipCompressorStream if the .gz file extension is used (#49 ). This should fit users' mental models of how reading and writing ought to work.

davidanthoff commented 5 years ago

We now have read and write support for gz compressed files. I'm leaving this open, in case we want to add support for more formats.