dir: the pairtree root or local directory to search in.
endpoint: Only for http queries. Really, this and 'dir' are
conceptually the same thing. But I'm not sure how to deal with that.
Those are all handled heavily by kwargs.
The classes are called FileHandler, JsonFileHandler, ParquetFileHandler. They're in a separate file called 'parsers.py' now, which also includes the id resolution code.
The base class has been a little unstupified; some things, like building ids or checking that
path and id aren't both set, don't need to be called in the subclasses. So I use super rather
than just overwriting.
Those classes have 'write' methods. Since writing should really only happen when you don't have a file already, I've added a quiet 'mode' flag to the kwargs in the handlers so you can set up for writing without having local data; and then you write by passing a 'volume' object which actually has data. This is a little messy.
There's a yaml file providing global configuration. By default, it grabs over http, and then saves to a local non-pairtree structure with bz2 compression for subsequent hits. That's a big change, of course! But the real point is to allow clean abstraction of research code from id storage techniques.
It will also be used to populate kwargs defaults, although I'm not totally sure how.
I'd be open to saying that there should be no defaults, but that if someone is hitting http a lot we start sending warnings including a code snippet that dumps them a sensible yml file.
This is a very preliminary pull request, but I didn't want to keep working on a refactor like this without getting any thumbs up.
To document some approaches I'm taking here:
The keys being used in functions are
format
: (rather than 'parser'): jsonid_resolver
: ['pairtree', 'local', 'http', 'https', 'path'].compression
: ['bz2', 'snappy', 'gz'].dir
: the pairtree root or local directory to search in.endpoint
: Only for http queries. Really, this and 'dir' are conceptually the same thing. But I'm not sure how to deal with that.Those are all handled heavily by kwargs.
The classes are called FileHandler, JsonFileHandler, ParquetFileHandler. They're in a separate file called 'parsers.py' now, which also includes the id resolution code.
super
rather than just overwriting.There's a yaml file providing global configuration. By default, it grabs over http, and then saves to a local non-pairtree structure with bz2 compression for subsequent hits. That's a big change, of course! But the real point is to allow clean abstraction of research code from id storage techniques. It will also be used to populate kwargs defaults, although I'm not totally sure how.
I'd be open to saying that there should be no defaults, but that if someone is hitting http a lot we start sending warnings including a code snippet that dumps them a sensible yml file.