Closed bmschmidt closed 4 years ago
Ben, thank you, I truly appreciate the comments. You've provided a lot to think about, and I may still have questions as I parse the suggestions.
You're right about class names. I try to follow the Google Style Guide, so it should be CapCase instead of camelCase.
What do you mean by "One thing I could see being useful in certain cases is to add 'cache' arguments to the parsers"? You mean global preferences?
I should note that path and id have been in place for years, since adding the HTTP-loading protocol - with the refactor, I just tried to keep the API consistent. I'm not if that was clear when you suggested reversing their order.
I'd like to build on your suggestion with a proposal. It sounds like you want the convenience of automatically resolved pairtree filenames - what if your logic went into a separate class, rather than adding the assumptions about pairtrees and filenames into Volume/FeatureReader? That could make things easier for you, me, and Jaimie, without affecting beginner or classroom users.
Adding the dynamically loaded HTTP-protocol a few years ago may have confounded things, because it doesn't have a path (and the URI is hardcoded to HT). Still, I like the clarity of pointing directly to a local file, rather than splitting that up into a mixture of id, path, and protocol.
I understand the convenience of an internally-parsed pairtree, so let's definitely figure out an implementation. I do want to stay sympathetic to the syntax for a beginner, where a folder of EF files, not necessarily named by HTID, should be loadable without additional syntax. I'd prefer looking for ways to not complicate the beginner syntax for the benefit of advanced users - ideally those changes could only be visible if you are the target advanced user. It also seems appropriate to avoid ID-based naming assumptions and pairtree logic from the parsing and analysis classes. This is why I like a couple of other approaches. What do you think of the following approaches:
Option A - a separate resolver class that gives you, me, and Jaimie easy automated filename expansion:
ptr = PairTreeResolver(root='path/to/root', extension='.json.bz2') # Extension is .json.bz2 by default
vol1 = Volume(ptr.id2path('htid'))
vol2 = Volume(ptr.id2path('htid2'))
# or
fr = FeatureReader(ptr.ids2paths(['htid', 'htid2'])
Option B - adding your advanced syntax to the Volume class as a method. If a Volume is initialized without options, the 'advanced' interface is allowed:
Volume().load(id = '...', path = '...', protocol='...')
Option C - adding each hardcoded option separated:
Volume().from_pairtree(id = '...', path = '...')
Volume().from_http(id = '...', path = '...')
Option ?? - wrapping the current approach:
pt = lambda x: 'pairtree/root/' + utils.id2rsync(x)
vol = Volume(pt(htid))
There are still two unresolved inconsistencies: 1) the HTTP interface is still tied to HT. This can be abstracted out if necessary, since 'path=...' takes HTTP or local file paths alike. 2) the parquet format points to a family of files rather than one, which is an ill-fit with the default json approach.
Option D - Doing it all in args a bit more simply:
Another approach to keeping your suggestions while maintaining the base interface that I'd like is your suggestion of using pairtree_root=
instead of a protocol
and custom_filename
. This would allow keeping the current "one of id
or path
" approach, and not change the meaning of path from what it's been. A tweak that I like treats it as an id_parser
(or maybe id_style?
):
# HTTP Option
Volume(id='htid', id_parser='http') # Default, which is just DL_URL.format(id)
# Custom Function Option
pt = lambda id: 'pairtree/root/' + utils.id2rsync(xid)
Volume(id='htid', id_parser=pt)
# Pairtree option, with global root setting
Volume.PAIRTREE_ROOT = 'pairtree/root') # Do just once
vol = Volume(id='htid', id_parser='pairtree')
# Pairtree option, with additional kwargs
vol Volume(id='htid', id_parser='pairtree', pairtree_root='pairtree/root')
Option E: Subclassing the parserClass. This is intended flexibility to the parser approach, and some of those rules should exist there for special cases.
class PairtreeJsonVolumeParser(JsonVolumeParser):
...
Likewise, if you wanted a JsonVolumeParser that drops the file extension from the file path, that can be similarly done.
What do you think makes sense?
Long post to say option D looks good; I'll post a separate response about caching.
Agreed that path
and ID
need a stable API. I thought that before now users would never call Volume constructor directly, instead relying on a FeatureReader
to do manage it, and only subsequently manipulating the Volume object. If that's wrong, request removed.
Just because I want to use this in teaching/workshops, let me explain how I'm conceiving beginning/advanced. For sure, you, me, and Jamie can do things however we want, and I don't mind monkeypatching or whatever. But what imagine a complete beginning doing, based on those I've had in classes, is wanting to import a complete list of IDs and not worry about where they're stored. So in the R package, I just tell users to set a directory location at the beginning of the script and subsequently cache everything to there. I message a warning if they try to download without caching, because that's lousy practice. And I try to default everyone into using local pairtree structures. I'm open to the argument that maybe local pairtree structures shouldn't be used by beginniners, but I like that they'll never break, and essentially boil down to using the filesystem as a DB. Having a bunch of files not named by their identifiers--that's what strikes me as something beginners wouldn't do. Likewise, I think you and I differ in that I imagine people will store their parquet files or self-parsed metadata in the root pairtree, and you imagine them using a new pairtree or folder. My thinking is partly b/c I'm hitting inode allocation limits on number of files in the NYU HPC system, which I don't think is too unusual a situation. Anyhow.
Likewise, I think using the FeatureReader class as a generator is probably going to become less common after this refactor, which is why I would love sensible defaults here.
I think Option D above is great. It addresses all of my usability concerns for the most common cases, with the exception of the path
argument not being a truth path for parquet. (Let's bracket that for a bit, but the current version needs at least idiot-proofing.) And I think that it probably provides the best resolution of anything for working with JStor-portico endpoints, which will be a nightmare. Essentially you're saying each call should use either a path (in which case parser can probably be automatically inferred) or an id and and id_parser (defaulting to http resolution to HTRC servers.) Sounds good to me.
But in that case, I wonder whether the parser class should be a little more shielded from end-users... see below.
Mulling over the weekend, I realized that some of what I want really is to write a new Parser class that follows this logic:
Another logic, aimed at creating a repo for work to be redistributed, might be:
A logic massiveTexts has developed is:
We're not going to be writing all of these. But I think you should bundle one or two. Opinionatedly, I think that something like the first of these should actually be the default for most new users. And we should think about what the defaults are principally in terms of reproducible research and bandwidth.
Happy to do this, but I wonder if these things are really "Parsers." What they are, is strategies that deploy a bunch of different parsers in sequence.
Maybe here's a way of thinking of thinking about it. Currently, the strategy for reading a parquet file is handled in ParquetParser (or whatever). The logic for writing is handled in the base Volume class. Wouldn't it make more sense for those to be bundled together into a single class called ParquetFileHandler(id_parser='pairtree') (or whatever)?
I'm inclined to think that users should maybe specify one of these strategies (and not the underlying parser. But I don't have this line of thought fully thought out yet, sorry.
Sorry, I closed this by accident.
One important question that will help me understand is: how will this API change for JStor feature counts? Is JStor going to be a new parser format, or able to use the existing ones?
JStor will have the same format as the new EF. Rather than having a lot of if/else version logic, the current format support would be preserved as a legacy parser, and the default json parser will be updated to support the new HTRC and Portico work.
Two more issues:
I'll close this issue and replace it with set of task-oriented issues.
I'm putting some comments on the parquet branch of the massivetexts branch here to share a little feedback in the appropriate place.
This is great, and I hadn't appreciate the fullness of the refactor. It's a major improvement that will set this up well for the Ithaka stage.
Paths, IDs, and loading strategies.
I want to think about the loading API here, because I think there are a lot of different formats people use and are likely to use. I'm especially thinking about my own teaching, and the desire to free students from lots of code assembling pairtree paths.
I love the abstraction you've created here.
But there's kind of an issue. Godawful ArcGIS conventions aside, '../data/parquet/mdp.39015028036104', isn't really a path. For instance, depending on IDE, tab-completion may get quite confused when trying to locate it. And while this works well in the case that you know precisely the file location you want, I rarely work that way; usually I store everything in a pairtree.
Plus there's an asymmetry: it makes me want to be able to do this method as well.
Parser functions.
It strikes me that there are four different elements here, really, for both the parser functions.
assume_filenames
method for parquetVolumeParser. I don't think that method makes a huge amount of sense: see below.)parquetVolumeParser
.Proposed changes
Could we think about these as the arguments instead?
id
means a HathiTrust or JSTOR generated id. Default None.path
means a local place to look in. Default None. Depending on format, may be a file (.json.bz2
or a directory (../data/parquet
) that encloses the file. (See pairtree below).).protocol
(ugh--better name?) defines how to locate the file given id and path.file
is the default; just follow the path.pairtree
means generate a pairtree location from a given root file on disk. I think I'm more wedded to this than you. But this issue by @JaimieMurdock makes me think it should go here. In this case, path cannot be None. (Or--hear me out--Volume
could have apairtree_root = None
attribute that individual users could modify on the fly to include a pairtree_root attribute that would be used by all further pairtree calls, just like they can currently edit jsonVolumeParser.DL_URL. Setting a pairtree_root might even automatically switch the default protocol to pairtree.)http
means pull from htrc. In this case, path must be none. (Or maybe path could be an http endpoint. But I think that's likely to be complicated.)rsync
could be useful.parser
:json
orparquet
at the moment. One thing I could see being useful in certain cases is to add 'cache' arguments to the parsers that have access to the protocol, path, and id elements here.custom_filename
: All the above assumes the id maps neatly to filename. Maybe people want to do something fancy; if so, this could be the escape hatch to allow it. But TBH, I'd be happy to enforce on people that they have to use preformatted ids.Handling default arguments.
My preference would be to enforced named arguments to this function, rather than implicitly allowing either 'path' or 'id' to be first. If not, I think ID should be first. But we could also allow a whole maze of fallback parsing strategies for user ease-of-use. Since the direct-access-by
Volume
method is new, I don't think we should worry about breaking current code inside MassiveTexts.id = None, path = None
: errorid = htid, path = None
: error unless method is explicitly http. (Could also load from local dir, switch method to http, various other things.)id = dirname/htid.json.bz2, path = None
: Error. Someone's passing a path as a dirname in unnamed args. Could check with regexes or split it out. But I think best just to raise an error.id = None, path = dirname
. Try to load 'dirname' using the json method. Will probably break.id = htid, path = "/local/htrc-ef-root/
: load from/local/htrc-ef-root/htid.json.bz2
, or some other strategy determined by the parser.Examples
Some of the following
CamelCase caps.
Since they are classes, shouldn't the parsers be titled
JsonVolumeParser
instead ofjsonVolumeParser
, etc? (You may know more than me here, just asking).