Support different read formats

scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations

BSD 3-Clause "New" or "Revised" License

40 stars 10 forks source link

Support different read formats #298

Closed bbotella closed 8 years ago

bbotella commented 8 years ago

I think that adding support for different formats in filebased readers is a must. That is both file format (csv, xml...) and compression formats (zip, gz, tar...)

elacuesta commented 8 years ago

If I understand correctly, current implementation only handles gzipped jsonlines files. What about:

deprecate the FSReader name in favor or something like JsonLinesReader
create JsonReader, CSVReader, XMLReader, etc

I'm not sure about compression. Should it be detected automatically? Or indicated in options? Like "options": {"compression": "gzip"}

@eliasdorneles Thoughts?

eliasdorneles commented 8 years ago

Hm, the thing is, other file-based readers (e.g. S3Reader) are also assuming the input is JSON lines.

I believe a better approach would be to extract out of FSReader and S3Reader the bits that understand the files to be JSON lines into a new abstraction (e.g. JsonLinesImporter, JsonImporter, CSVImporter, ...) and use those in all file-based readers (currently only FSReader and S3Reader, in the future we'd have SftpReader, DropboxReader, etc).

Essentially, the idea would be to do for the readers the same as we did for the writers: the file-based writers support writing to different formats (XML, CSV, JSON) through a formatter.

bbotella commented 8 years ago

Yup. Totally agreed with @eliasdorneles . Thing is to make "file format" and "file compression" independent of where it is read, just like we do with writers at this point. I like the idea of having a FileBasedReader that handles this. In the future, we could even try with "automatic format detectors".

eliasdorneles commented 8 years ago

Added in #316