sul-dlss-deprecated / rialto-etl

ETL tools for RIALTO, Stanford Libraries' research intelligence project
https://library.stanford.edu/projects/rialto
Apache License 2.0
3 stars 0 forks source link

generate fuller (no enhancements) transform map for CAP organizations #5

Closed cmharlow closed 6 years ago

cmharlow commented 6 years ago

blocked by #1 and by #4

mjgiarlo commented 6 years ago

I'm still at a bit of a loss as to how to get Traject to parse objects of the shape I have (a tree, where every object may have 0 or more children):

{
  "name": "whatever",
  "children": [
    {
      "name": "foo",
      "children": [ ... ]
    }
  ]
}

Unclear to me what seam is provided in Traject that lets me say "descend into children and treat this list as additional records."

Options

  1. Figure this out (assuming there's a good answer in Traject/TrajectPlus already. (Pester folks like @cbeer, @billdueber, and @jrochkind for more information.)
  2. Pre-process the Stanford orgs JSON to look like something more "Traject-friendly."
  3. Wrap the Traject::Indexer in the StanfordOrganizations transformer with knowledge of this structure, and call Traject once per level of the hierarchy.
  4. Write a custom reader based on TrajectPlus::JsonReader.
billdueber commented 6 years ago

One option would be to define a reader class that recursively walks through each object. A reader has a simple interface (and probably shouldn't have any interface other than to be an iterator -- I've been thinking along those lines). But for now, you could do something like this.

require 'json'
class MyJsonHierarchyReader

  # @param [#each] input_stream Probably an open file, but any iterator will do
  # so long as it returns a valid JSON object from #each
  def initialize(input_stream, settings)
    # ... whatever you need to do. Let's pretend it's
    # a newline-delimited JSON file, since you didn't
    # specify anything
    @json_lines = input_stream
  end

  def each(&blk)
    @json_lines.each do |jline|
      h = JSON.parse jline
      recursive_yield(h, blk)
    end
  end

  # Do a depth-first crawl
  def recursive_yield(h, blk)
    doc = build_a_doc(h)
    blk.call doc
    Array(h['children']).each do |child|
      recursive_yield(child, blk)
    end
  end

  def build_a_doc(h)
    {"my_id" => h['id'] }
  end

end

I made some sample data at this gist

Of course, if your children need to know something about the parents, you'll have to inject that data as you crawl down the tree, but that shouldn't be hard.

If you wanna post a real record or two, it might help.

billdueber commented 6 years ago

And here's a working traject setup.

jrochkind commented 6 years ago

That's your input? And you aren't using it at all with traject yet, but would like to?

Yep, you need a custom reader, and bill's outline seems about right, if that is doing what you want. If your input files will be at all large, I'd try to parse them in a streaming fashion (can you do that with json?), and if you yield them streaming traject will keeps everything streaming and avoid loading everything into memory at once.

If you need to access multiple levels at once... it might get trickier, and also involve a custom writer, that knows it'll be writing more than one record (maybe just taking an array of output and delegating each one to an existing writer).

I hadn't known TrajectPlus was a thing, just seeing it now for the first time. Looks like some pretty good stuff. Would it make sense to link to it in Traject README? And/or merge some of it into traject, as optional classes/modules to use to get the behavior?

cbeer commented 6 years ago

I hadn't known TrajectPlus was a thing, just seeing it now for the first time. Looks like some pretty good stuff. Would it make sense to link to it in Traject README? And/or merge some of it into traject, as optional classes/modules to use to get the behavior?

TrajectPlus is relatively new and is a convenient place to experiment with some new (-to-us) patterns, and we'll definitely continue to push the useful ideas upstream (like traject#145). I suspect some of the macros (compose + conditional) are close, and some would still benefit from refinement or generalization.

cmharlow commented 6 years ago

@mjgiarlo i liked option 3 as a 'just get the proof of concept stuff done to test ourselves'. But with bill's recommendations above, it may be better to go step 4 (starting with the JSON Reader extension maybe being specific to the CAP output, like a CapJsonReader), until we get a bit further with this and other data sources to look towards generalization.

cmharlow commented 6 years ago

@billdueber we're not sure we can share the data from the sources yet (see #7), but thanks for the rec - we had spitballed something like this, but its great to see your input.

jrochkind commented 6 years ago

Awesome, thanks @cbeer that makes sense. It looks like for most of them you don't need to monkey-patch existing traject, they are just new readers/writers, or new macros (which could be provided as options you include or extend into your indexer, rather than on by default), which makes them pretty easy to include in the main distro.

I agree it makes sense to try em out separately until they are proven, but I'd def be interested in incorporating em upstream. In the meantime, maybe a link in the traject README to the experimental/in-development options in TrajectPlus might make sense.

Eventually, if we have solid support for XML and other non-MARC features, I've been thinking for a while of a refactor of traject where the marc-specific stuff is not in the base class, so you don't have it if you don't need it. That would probably require a major-version bump, and some thought about how to still make the command line super convenient to use (perhaps sub-commands like traject marc, traject xml, traject json that turn on different default modules.)

mjgiarlo commented 6 years ago

Wow, thanks all! I wrote up the three options above and then about ten minutes later, I wound up going down the road @billdueber suggested, though to start I'm just using TrajectPlus::JsonReader.class_eval { } to test out the approach. This validates the approach, methinks, and I'll plan to break this into a class of its own within Rialto::Etl.

jrochkind commented 6 years ago

It's definitely totally fine to write a special-purpose project-specific reader, nothing wrong with that, it's an intended seam, and should be forwards compatible for a long time.

cmharlow commented 6 years ago

just recording here convo on slack - @mjgiarlo i really like that approach here for modularity of RIALTO source loaders.

mjgiarlo commented 6 years ago

I've got this working, thanks to y'all. Pushing my branch momentarily. Not closing this issue until we also incorporate the work in #1.

cmharlow commented 6 years ago

i'm going to look now, add stuff from #1 , get this wrapped up.

mjgiarlo commented 6 years ago

@jrochkind @billdueber @cmh2166 Here's what I wound up with, btw: https://github.com/sul-dlss-labs/rialto-etl/commit/da12a15b3a2ecf6ff9be71b9637ec3b70cc9ddc6#diff-db585871001355a5ca6af51c1c774eba

jrochkind commented 6 years ago

@mjgiarlo I don't entirely understand what's going on there, but I like that it's not very many lines of code and no monkey-patching or anything, so that suggests it's looking good to me! :)

I don't totally understand how you use traject to write to RDF, which is what seems to be going on here? In general, I'd like to understand y'alls use cases better to see if I can make core traject support them better in a 3.0 release. @mjgiarlo would you have time to chat synchronously/voice some time about your use cases?

cmharlow commented 6 years ago

@mjgiarlo Great. will try to review my work on the fuller MAP and tests from before break. Need to catch up on the RIALTO work status here sometime this weekend.

mjgiarlo commented 6 years ago

Since the blocking issues are closed and I've done some work to update the mapping, I'm going to close this issue. Let's open new issues for specific mapping implementation changes as we go. Thanks, all.