htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Feedback on new parsing strategy #32

Closed bmschmidt closed 4 years ago

bmschmidt commented 4 years ago

I'm putting some comments on the parquet branch of the massivetexts branch here to share a little feedback in the appropriate place.

This is great, and I hadn't appreciate the fullness of the refactor. It's a major improvement that will set this up well for the Ithaka stage.

Paths, IDs, and loading strategies.

I want to think about the loading API here, because I think there are a lot of different formats people use and are likely to use. I'm especially thinking about my own teaching, and the desire to free students from lots of code assembling pairtree paths.

I love the abstraction you've created here.

But there's kind of an issue. Godawful ArcGIS conventions aside, '../data/parquet/mdp.39015028036104', isn't really a path. For instance, depending on IDE, tab-completion may get quite confused when trying to locate it. And while this works well in the case that you know precisely the file location you want, I rarely work that way; usually I store everything in a pairtree.

Plus there's an asymmetry: it makes me want to be able to do this method as well.

Parser functions.

It strikes me that there are four different elements here, really, for both the parser functions.

  1. There's a path to the file. This shows where the file is located.
  2. There's an ID. In many cases, it can be inferred from the path. In others, maybe not. Currently this is kind of an option, used implicitly to trigger web loading on the jsonVolumeParser class.
  3. There's a strategy for loading.
    1. Find it on the Hathi Trust server using hardcoded URL generation (how does this change when JStor starts offering these?).
    2. Find the exact file on the filesystem. (Or files, using the unimplemented assume_filenames method for parquetVolumeParser. I don't think that method makes a huge amount of sense: see below.)
    3. Follow the filename conventions in parquetVolumeParser.
  4. There's a format (parquet, .json.bz2) that defines the parser, plus maybe some subarguments to that format.

Proposed changes

Could we think about these as the arguments instead?

Handling default arguments.

My preference would be to enforced named arguments to this function, rather than implicitly allowing either 'path' or 'id' to be first. If not, I think ID should be first. But we could also allow a whole maze of fallback parsing strategies for user ease-of-use. Since the direct-access-by Volume method is new, I don't think we should worry about breaking current code inside MassiveTexts.

  1. id = None, path = None: error
  2. id = htid, path = None: error unless method is explicitly http. (Could also load from local dir, switch method to http, various other things.)
  3. id = dirname/htid.json.bz2, path = None: Error. Someone's passing a path as a dirname in unnamed args. Could check with regexes or split it out. But I think best just to raise an error.
  4. id = None, path = dirname. Try to load 'dirname' using the json method. Will probably break.
  5. id = htid, path = "/local/htrc-ef-root/: load from /local/htrc-ef-root/htid.json.bz2, or some other strategy determined by the parser.

Examples

Some of the following

CamelCase caps.

Since they are classes, shouldn't the parsers be titled JsonVolumeParser instead of jsonVolumeParser, etc? (You may know more than me here, just asking).

organisciak commented 4 years ago

Ben, thank you, I truly appreciate the comments. You've provided a lot to think about, and I may still have questions as I parse the suggestions.

Building on loading strategy

I'd like to build on your suggestion with a proposal. It sounds like you want the convenience of automatically resolved pairtree filenames - what if your logic went into a separate class, rather than adding the assumptions about pairtrees and filenames into Volume/FeatureReader? That could make things easier for you, me, and Jaimie, without affecting beginner or classroom users.

Adding the dynamically loaded HTTP-protocol a few years ago may have confounded things, because it doesn't have a path (and the URI is hardcoded to HT). Still, I like the clarity of pointing directly to a local file, rather than splitting that up into a mixture of id, path, and protocol.

I understand the convenience of an internally-parsed pairtree, so let's definitely figure out an implementation. I do want to stay sympathetic to the syntax for a beginner, where a folder of EF files, not necessarily named by HTID, should be loadable without additional syntax. I'd prefer looking for ways to not complicate the beginner syntax for the benefit of advanced users - ideally those changes could only be visible if you are the target advanced user. It also seems appropriate to avoid ID-based naming assumptions and pairtree logic from the parsing and analysis classes. This is why I like a couple of other approaches. What do you think of the following approaches:

Option A - a separate resolver class that gives you, me, and Jaimie easy automated filename expansion:

ptr = PairTreeResolver(root='path/to/root', extension='.json.bz2') # Extension is .json.bz2 by default
vol1 = Volume(ptr.id2path('htid'))
vol2 = Volume(ptr.id2path('htid2'))
# or
fr = FeatureReader(ptr.ids2paths(['htid', 'htid2'])

Option B - adding your advanced syntax to the Volume class as a method. If a Volume is initialized without options, the 'advanced' interface is allowed:

Volume().load(id = '...', path = '...', protocol='...')

Option C - adding each hardcoded option separated:

Volume().from_pairtree(id = '...', path = '...')
Volume().from_http(id = '...', path = '...')

Option ?? - wrapping the current approach:

pt = lambda x: 'pairtree/root/' + utils.id2rsync(x)
vol = Volume(pt(htid))

There are still two unresolved inconsistencies: 1) the HTTP interface is still tied to HT. This can be abstracted out if necessary, since 'path=...' takes HTTP or local file paths alike. 2) the parquet format points to a family of files rather than one, which is an ill-fit with the default json approach.

Option D - Doing it all in args a bit more simply:

Another approach to keeping your suggestions while maintaining the base interface that I'd like is your suggestion of using pairtree_root= instead of a protocol and custom_filename. This would allow keeping the current "one of id or path" approach, and not change the meaning of path from what it's been. A tweak that I like treats it as an id_parser (or maybe id_style?):

# HTTP Option
Volume(id='htid', id_parser='http') # Default, which is just DL_URL.format(id)

# Custom Function Option
pt = lambda id: 'pairtree/root/' + utils.id2rsync(xid)
Volume(id='htid', id_parser=pt)

# Pairtree option, with global root setting
Volume.PAIRTREE_ROOT = 'pairtree/root') # Do just once
vol = Volume(id='htid', id_parser='pairtree')

# Pairtree option, with additional kwargs
vol Volume(id='htid', id_parser='pairtree', pairtree_root='pairtree/root')

Option E: Subclassing the parserClass. This is intended flexibility to the parser approach, and some of those rules should exist there for special cases.

class PairtreeJsonVolumeParser(JsonVolumeParser):
    ...

Likewise, if you wanted a JsonVolumeParser that drops the file extension from the file path, that can be similarly done.


What do you think makes sense?

bmschmidt commented 4 years ago

Long post to say option D looks good; I'll post a separate response about caching.

  1. Agreed that path and ID need a stable API. I thought that before now users would never call Volume constructor directly, instead relying on a FeatureReader to do manage it, and only subsequently manipulating the Volume object. If that's wrong, request removed.

  2. Just because I want to use this in teaching/workshops, let me explain how I'm conceiving beginning/advanced. For sure, you, me, and Jamie can do things however we want, and I don't mind monkeypatching or whatever. But what imagine a complete beginning doing, based on those I've had in classes, is wanting to import a complete list of IDs and not worry about where they're stored. So in the R package, I just tell users to set a directory location at the beginning of the script and subsequently cache everything to there. I message a warning if they try to download without caching, because that's lousy practice. And I try to default everyone into using local pairtree structures. I'm open to the argument that maybe local pairtree structures shouldn't be used by beginniners, but I like that they'll never break, and essentially boil down to using the filesystem as a DB. Having a bunch of files not named by their identifiers--that's what strikes me as something beginners wouldn't do. Likewise, I think you and I differ in that I imagine people will store their parquet files or self-parsed metadata in the root pairtree, and you imagine them using a new pairtree or folder. My thinking is partly b/c I'm hitting inode allocation limits on number of files in the NYU HPC system, which I don't think is too unusual a situation. Anyhow.

  3. Likewise, I think using the FeatureReader class as a generator is probably going to become less common after this refactor, which is why I would love sensible defaults here.

  4. I think Option D above is great. It addresses all of my usability concerns for the most common cases, with the exception of the path argument not being a truth path for parquet. (Let's bracket that for a bit, but the current version needs at least idiot-proofing.) And I think that it probably provides the best resolution of anything for working with JStor-portico endpoints, which will be a nightmare. Essentially you're saying each call should use either a path (in which case parser can probably be automatically inferred) or an id and and id_parser (defaulting to http resolution to HTRC servers.) Sounds good to me.

But in that case, I wonder whether the parser class should be a little more shielded from end-users... see below.

bmschmidt commented 4 years ago

Cache hit strategies

Mulling over the weekend, I realized that some of what I want really is to write a new Parser class that follows this logic:

  1. First, load a parquet if it exists.
  2. Otherwise, load a json.bz2 or json file if it exists, and create the parquet files for me. (And maybe delete the json.bz2 to save space, if I ask for it.
  3. If not, pull the file directly from an endpoint.

Another logic, aimed at creating a repo for work to be redistributed, might be:

  1. Load from a local directory with 1000 .json.bz2 files in it.
  2. Check a local pairtree of all Hathi files, load, and copy to that local directory.
  3. Fail, because we're pretty sure that local tree is complete and don't want to waste bandwidth calling online.

A logic massiveTexts has developed is:

  1. Load from a local parquet file that includes chunks
  2. Write to it if it doesn't exist.

We're not going to be writing all of these. But I think you should bundle one or two. Opinionatedly, I think that something like the first of these should actually be the default for most new users. And we should think about what the defaults are principally in terms of reproducible research and bandwidth.

Happy to do this, but I wonder if these things are really "Parsers." What they are, is strategies that deploy a bunch of different parsers in sequence.

Maybe here's a way of thinking of thinking about it. Currently, the strategy for reading a parquet file is handled in ParquetParser (or whatever). The logic for writing is handled in the base Volume class. Wouldn't it make more sense for those to be bundled together into a single class called ParquetFileHandler(id_parser='pairtree') (or whatever)?

I'm inclined to think that users should maybe specify one of these strategies (and not the underlying parser. But I don't have this line of thought fully thought out yet, sorry.

bmschmidt commented 4 years ago

Sorry, I closed this by accident.

One important question that will help me understand is: how will this API change for JStor feature counts? Is JStor going to be a new parser format, or able to use the existing ones?

organisciak commented 4 years ago

JStor will have the same format as the new EF. Rather than having a lot of if/else version logic, the current format support would be preserved as a legacy parser, and the default json parser will be updated to support the new HTRC and Portico work.

organisciak commented 4 years ago

Two more issues:

I'll close this issue and replace it with set of task-oriented issues.