okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
387 stars 110 forks source link

Zip files cannot be loaded over a socket #58

Open rossjones opened 11 years ago

rossjones commented 11 years ago

Loading a remote zipped file breaks messytables (primarily because it can't seek on the file-like object)

    import urllib 
    fh = urllib.urlopen('http://data.gov.uk/data/resource_cache/6d/6d8a0f2d-db23-40ea-8b40-eb20eb75b07f/wastedata-200809.zip')
    table_set = ZIPTableSet(fh)

It should be possible to wrap the fobj for ZipTableSet in a seekable-stream, but the bufferedfile seek method doesn't have enough arguments (seek has two args, pos and whence=0) which means the check whether to load more data will require taking whence into account.

JoshData commented 11 years ago

Here's how I solved that: https://github.com/tauberer/messytables/commit/9873ee63f2a5d034a50068b36faf708edbc5902d

rossjones commented 11 years ago

I think BufferedFile does now have the first parameter and implementing whence is do-able, but I think you approach (you're going to download it all anyway so put it in StringIO) might be a better approach that pretending BufferedFile works. I think I'd still prefer not to keep all that stuff in memory though.

domoritz commented 11 years ago

@tauberer Could you make a pull request with your fixes?

JoshData commented 11 years ago

I'd rather see implemented the suggestion in #59 to create a way to have messytables cache a file locally on disk, and then the ZIP table can just require that streams be cached locally. Buffering in memory is a kludge since it can easily lead to out of memory issues.

rossjones commented 11 years ago

Totally agree with @tauberer suggestion, is this something you'd consider merging?

domoritz commented 11 years ago

The way I see it is that streaming only really makes sense if the data is in a tabular text format. We could say that streaming is only supported for the CSV type because that it where it makes most sense. Having a way for each type to define whether they require the files to be stored locally would be even better. So, yes, I'd merge @tauberer suggestion.

@tauberer You should have access to this repo so you should be able to create a branch that we can all work on.

JoshData commented 11 years ago

I'd be glad to create a branch but there are some architectural questions to decide about it first, which is I think more appropriate on #59's thread. Also I'm on vacation now and am not really funded for this sort of work, so I probably can't help much at the moment.

JoshData commented 11 years ago

@rossjones : I was agreeing with your suggestion. If you agree with that we'll have infinite recursion and the universe may implode. :)