Proposal: adopt Traject as interface between Marc and Solr

fjorba commented 3 years ago

An important part of what Muscat does is to provide values to Solr in order to create the indices. So, Muscat, in the app/controllers/catalog_controller.rb extracts the appropiate values for each Marc tag and subfield and sends them to Solr in order to be indexed.

There is a mature and stable gem that does exactly this job: Traject (https://github.com/traject/traject). Traject knows Marc, the semantics and particularities of each tag and subfield, and sends the appropiate value to Solr in order to be indexed. Traject is commonly used by the Blacklight communitiy to create the huge catalogs that the American university libraries have. So, it has been proven and polished transforming milions of records. It provides higher level macros, personalisation at will, Marc21 conversion tables, etc.

Traject can work as batch reader-and-feeder, reading Marc21 dump records and directly sending the output to Solr, or programatically (https://github.com/traject/traject/blob/master/doc/programmatic_use.md). So, adopting Traject would allow Muscat to work with Marc fields at a higher (semantic) level and know that all lower details or exceptions would be handled by it.

PS Jonathan Rochkind, the main Traject author, gave an interesting presentation this year in the Code4lib conference where provides some of the ideas behind Traject stability: https://bibwild.wordpress.com/2021/03/23/code-that-lasts-sustainable-and-usable-open-source-code/

fjorba commented 3 years ago

Really, this proposal was in my TODO queue since several months, and my goal was to postpone it to sometime in the future, but somewhat I felt, when defending my #1106 pull request, that the presence of Traject could help its adoption.

xhero commented 3 years ago

What advantage would it have over the current indexer in Muscat? Also the current system we have indexes regularly millions of records.

fjorba commented 3 years ago

I seriously started to think about using Traject when reading https://github.com/rism-digital/muscat/blob/master/app/controllers/catalog_controller.rb (with comments about how hackish are some of the solutions) and trying to understand which changes/additions I would need to index secondary literature records for my field of use. I'm used to work with Marc fields and subfields for the last 30 years, and most of them I know them by heart (I'm a librarian myself), but the thought of having to rewrite (again!) for yet another piece of software put me off. And while reading, learning, and familiarising myself with Muscat, Ruby, Rails, Solr and Blacklight, I kept finding Traject once and again.

So, as I undersand it, using Traject default values, as it knows Marc, the results indices will make sense, but they can be easily modified. Using their examples:

    to_field "id", extract_marc("001")

    to_field "title_t", extract_marc("245aps:130")

    # Can limit to certain indicators with || chars.
    # "*" is a wildcard in indicator spec.  So this is
    # 856 with first indicator '0', subfield u.
    to_field "email_addresses", extract_marc("856|0*|u")

    # Can list tag twice with different field combinations
    # to extract separately
    to_field "isbn", extract_marc("245a:245abcde")

    # For MARC Control ('fixed') fields, you can optionally
    # use square brackets to take a byte offset.
    to_field "language_code", extract_marc("008[35-37]")

So, certainly my first goal was selfish, as I wanted to avoid the work of creating generic indices for a specific domain piece of sofware (Muscat), but with my accumulative experience, I know that I will have to handle punctuation details, non-latin alphabets, new indices that appear now and then (DOI, Orcid, Medline, ...) and each with its particularity (is DOI case-sensitive?). Marc is Marc, and it is stable and well known, and Traject fills this niche: which 100 subfields are really needed to be indexed to search an author (or composer)? Do I have to index 536 $g? Do I have, again, to provide a list of languages (no: https://github.com/traject/traject/blob/master/lib/translation_maps/marc_languages.yaml) or even musical instruments (nope! https://github.com/traject/traject/blob/master/lib/translation_maps/marc_instruments.yaml).

Yes, for musical sources, Muscat may have the indices it needs, but reading the catalog_controller there are some FIXME that need attention. I think that adopting Traject would simplify the catalog_controller, and would provide additional fields (like a sortable version, like in https://github.com/traject/traject/blob/c2d75d94e3bf2b4bb7b6328d94fe834e57947e66/lib/traject/macros/marc21_semantics.rb#L51). For example, I think that the composer vs author would be changing something like a to_field("author") with a to_field("composer"), as the tags and subfields are the same. (In our DDD, we use a similar trick to create a professor index for some academic collections)

And Traject would allow Muscat to jump to a well accomodated bandwagon, with most of the same Blacklight colleagues there.

Again, I'm not yet competent enough in Rails to provide a quick demonstration. And we don't need a resolution now. But I was interested in opening the issue so you can read about my proposal and read and think about it. I can create indices for secondary literature by hand (again), but I know that, thinking in the longer term, there are better solutions, and I my bet is that Traject is the right tool here.

fjorba commented 2 years ago

I'm closing it as it does not match current Muscat developments.

rism-digital / muscat

Proposal: adopt Traject as interface between Marc and Solr #1110