Open davidanthoff opened 6 years ago
This is really needed for data analysis.
This is something sorely needed, but it looks that the work needed on TranscodingStreams.jl isn't happening anytime soon, unfortunately. I eventually found some workarounds for some .zip csv files I needed, but this would make things easier, no doubt.
It turns out that we had support for this for a long time, I just wasn't aware of it :) Try this:
using FileIO
load(File(format"CSV", "foo.csv.gz")) |> DataFrame
So really all that is needed is some documentation about it.
This doesn't seem to work for .zip files.
File(format"CSV", filepath)
works and produces an object
File{DataFormat{:CSV}}
But when passing this object to the load function, it fails.
Can we do the opposite and save a DataFrame to a gzipped CSV file? If so, I'll add some docs to close this issue.
Not at the moment. I think there are two levels of support we could add:
1) Add support for savestreaming
from FileIO.jl, and then it might work with some external compression library
2) Add support to the existing file saving function to compress on save if the filename has a compressed file ending.
The following works:
using CSVFiles, DataFrames, GZip
df = DataFrame(a = 1:10)
GZip.open("df.gz", "w") do io
save(Stream(format"CSV", io), df)
end
load(File(format"CSV", "df.gz"))
I don't mind taking a look into this and tidying it up. There are quite a few gzip packages out there. Is there a preferred one to use? Maybe GZip because it's in the JuliaIO organization?
Cool! And ah, we already have support for saving to streams, I had forgotten about that!
I guess the question is whether we should add support so that one uses File(format"CSV"
with a .gz
file extension it compresses automatically... I think in some sense we are probably abusing the FileIO design a bit with that, but on the other hand, it would be handy and probably not much harm done?
I think in terms of packages, if https://github.com/bicycle1885/CodecZlib.jl works it would probably be best, because we already use it for the read case in TextParse.jl.
I think it would be very handy! I've just implemented a modified version of the _save
function that uses CodecZlib
's GzipCompressorStream
if the .gz
file extension is used (#49 ). This should fit users' mental models of how reading and writing ought to work.
We now have read and write support for gz compressed files. I'm leaving this open, in case we want to add support for more formats.
Here is one approach, would be nice to support this in some easier way. Not super clear how that would interact with the FileIO story, though...