scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

Stream based file reader #316

Closed dvdbng closed 8 years ago

dvdbng commented 8 years ago

This introduces the concept of a "Stream" based reader, where the underlying storage backend operates in files/bytes (like S3) as opposed to structured objects (like hubstorage).

Having binary/stream based readers allows us to have pluggable decompressors and deserializers, instead of hardcoding them in the readers.

This reuses the same get_read_streams method that the stream bypass uses

Also adds a CSV desserializer.

Improvements planned:

eliasdorneles commented 8 years ago

I like where this is going, looks like a good way of addressing #298 :+1:

eliasdorneles commented 8 years ago

So, I've been testing this and it's working pretty fine, I'm quite happy with it -- awesome work, @Youwotma ! :)

I have only a couple comments, but they're really small things:

1) It's assuming ZLibDecompressor by default, which is good because of backwards compatibility, but nothing is being shown in the log to indicate that. It would be nice to have some message in the logs to help users trying to use it with uncompressed files -- currently one just gets a stacktrace with:

ERROR:export-pipeline: -- EXPORTMANAGER -- Error -3 while decompressing: incorrect header check

Logging No decompressor configured, using default ZLibDecompressor is a good start. Catching that exception and appending a message like (use NoDecompressor if you're using uncompressed input) is even better

2) The same applies for deserializers, which the assumed default is JSON lines (albeit probably a lesser problem, since).