[discussion] messytables should *only* work with local files

okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py

http://messytables.readthedocs.io/

387 stars 110 forks source link

[discussion] messytables should only work with local files #59

Closed rossjones closed 11 years ago

rossjones commented 11 years ago

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

rufuspollock commented 11 years ago

This would break use of messytables on e.g. appengine where it is used for jsondataproxy (at least i think it would). One of the main reasons for having streaming was to support that.

What are your thoughts. What are others thoughts on the tradeoff?

rossjones commented 11 years ago

jsondataproxy downloads the entire file before processing, doesn't it? If it doesn't it is probably currently doing so for xls files and zip files will be broken. As will anything that is an office doc encoded as CDF (could be .pub, .doc etc).

CDF http://jsonpdataproxy.appspot.com/?url=http://www.hm-treasury.gov.uk/d/hmt_businessplan_qds_april12.xls Workbook is encrypted (it isn't, it's just CDF) and magic.from_buffer doesn't read enough of the header to work out it is actually XLS and instead thinks it is corrupted CDF.

http://jsonpdataproxy.appspot.com/?url=http://www.connectingforhealth.nhs.uk/systemsandservices/data/ods/genmedpracs/datafiles/epraccur.zip Doesn't currently handle ZIP (but it could)

I know that these seem like fix-able bugs, but the only way to fix them is to download the entire content (to pass the full file to magic, and to be able to seek into the zip). Just seems that if you're going to download them anyway ...

rufuspollock commented 11 years ago

No, jsondataproxy does not download the whole file (at least for CSV) - it uses the stream approach (note for xls it of course does download the whole file ...)

JoshData commented 11 years ago

Allowing a complete download as an option --- or allow streaming as an option --- would be very helpful. I use messytables to load data into the CKAN datastore, so I need to download the whole file anyway.

domoritz commented 11 years ago

I think support for streaming is a huge advantage and an initial design goal of messytables. If we don't want this any more, I suggest we rewrite large parts of it anyway. This would mean major simplifications and a messytables would become much simpler. Despite this advantage, I think streaming support os an awesome feature that makes many things much faster. With streaming support, I could make the dataproxy answer in just a few seconds instead of tens of seconds.

You are probably right, that many applications do not need streaming support but some definitely benefit from it.

scraperdragon commented 11 years ago

I think there's a big difference between "should only work with local files" and "should stream"

I don't care about streaming; but I do care about it working with remote files - e.g. things grabbed from a webservice. Now, granted, I often just say:

data = requests.get("http://target.site/data.csv").content
filehandle = StringIO.StringIO(data)
do_some_stuff_with(filehandle)

but I wouldn't want internals to assume that there's always a file on the disk that has a filename and is tolerably cheap to just keep reading all the time.

rossjones commented 11 years ago

Problem with that approach is that it means we often end up with bit chunks of data in RAM. And whilst it attempts to 'stream' everything in a lot of cases (non-csv) it ends up just doing that. I guess I'd just prefer that if it knew it couldn't stream it, there was an alternative to me keeping 20Mb xls files in RAM. Doesn't seem a lot for a single user, but for dozens/hundreds ...

frabcus commented 11 years ago

It sounds to me like it should either take a filehandle (and then stream, buffering if necessary) or a filename (and then know it can seek without buffering).

Can you inspect a filehandle and find out if it refers to local data? (either already in memory, or already on local disk)

rossjones commented 11 years ago

You can check if it is seekable, and the BufferedFile is provided for that purpose. The problem is that the BufferedFile seek() implementation is broken, and the way BufferedFile is used means it just all gets read into a stringio.Would be tempting for non-seekable FDs to just download the data to disk and then carry on with the new file FD. Which is sort of what I was after.

rossjones commented 11 years ago

I'm a bit clearer on expectations for how it should work now, so I think probably safe to close this ticket. I think my main concern is covered by #58

okfn / messytables

[discussion] messytables should *only* work with local files #59

[discussion] messytables should only work with local files #59