htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

File handler refactor #33

Closed bmschmidt closed 4 years ago

bmschmidt commented 4 years ago

This is a very preliminary pull request, but I didn't want to keep working on a refactor like this without getting any thumbs up.

To document some approaches I'm taking here:

  1. The keys being used in functions are

    • format: (rather than 'parser'): json
    • id_resolver: ['pairtree', 'local', 'http', 'https', 'path'].
    • compression: ['bz2', 'snappy', 'gz'].
    • dir: the pairtree root or local directory to search in.
    • endpoint: Only for http queries. Really, this and 'dir' are conceptually the same thing. But I'm not sure how to deal with that.
  2. Those are all handled heavily by kwargs.

  3. The classes are called FileHandler, JsonFileHandler, ParquetFileHandler. They're in a separate file called 'parsers.py' now, which also includes the id resolution code.

    • The base class has been a little unstupified; some things, like building ids or checking that path and id aren't both set, don't need to be called in the subclasses. So I use super rather than just overwriting.
    • Those classes have 'write' methods. Since writing should really only happen when you don't have a file already, I've added a quiet 'mode' flag to the kwargs in the handlers so you can set up for writing without having local data; and then you write by passing a 'volume' object which actually has data. This is a little messy.
  4. There's a yaml file providing global configuration. By default, it grabs over http, and then saves to a local non-pairtree structure with bz2 compression for subsequent hits. That's a big change, of course! But the real point is to allow clean abstraction of research code from id storage techniques. It will also be used to populate kwargs defaults, although I'm not totally sure how.

I'd be open to saying that there should be no defaults, but that if someone is hitting http a lot we start sending warnings including a code snippet that dumps them a sensible yml file.