inveniosoftware / dojson

Simple pythonic JSON to JSON converter.
https://dojson.readthedocs.io
Other
10 stars 29 forks source link

RFC: handling of indicators in Marc21 #19

Open aw-bib opened 9 years ago

aw-bib commented 9 years ago

If I understand it correctly, the Marc21 format allows for arbitrary indicators, addressing the usecases mentioned by @fjorba. This seems a definite improvement compared to Invenio 1.x.

For illustrative purposes I'll just use Marc 100 below, however, Marc has several fields where these issues apply as well.

If I understood the schema correctly, however, each marc field gets mapped to a JSON internal name. So, e.g. a field like 100__ gets mapped to main_entry_personal_name. Similarly, 1000_, 1001_, 1003_ get mapped to main_entry_personal_name as well, so all author personal names end up in the same JSON field. Again this addresses nicely the usecase of @fjorba as finally all authors regardless of the indicators get indexed and displayed as behind the ingestion only main_entry_personal_name is used.

In discussions with @martinkoehler we now wondered about dissemnation and probably indexing issues from there.

Say, I ingest Marc21 records that are using 1001_. In Marc-language this refers to $a to store Surename, Forename. So the 1_ gives semantic introducing the concepts of Surename and Forename and define how they should be extracted.

100 1_ $aAdams, Henry

Now I ingest from another Marc source, and I get 1000_. Here the 0 signifies, that the name in $a is a forename. The cannonical example at LoC being

100 0_ $aJohn $cthe Baptist, Saint

Sidenote to @martinkoehler: from the examples to 0_ it is clear, that this does not refer to a storage like Henry Adams compared to 1_ Adams, Henry, but that it is indeed meant for name entities that consist of a forename only, like e.g. popes, saints or artists names.

In this discussion we also came to the point that it would be possible in principle to treat 1_ programatically as "split the name at , to the concepts of forename and surename and store them in two JSON fields. We were not clear if this is intended. It could address the dissemination issue mentioned below.

Another case is 1003_:

100 3_ $aFarquhar family

Where you do not have a concept of forename / surename but the concept of family name. (Note: I'd have to check if RDA would not drop the family in the above, it is clearly expressed in 3_ already can be a left over from ISBD in the AACR. At least I'd prefer to drop it.)

For indexing, one can argue that in the word index for names it might be no issue to treat them all alike. Regardless what you search is, Adams, Henry or Heny Adams the word index will take care of it probably treating the , as an unsignificant character. It might come up in the phrase index however, at least if one has a mixed storage (say one 100__$aHenry Adams).

Some thoughts on this?

The second point and actually the main concerns arrise from reexporting to Marc. If 1001_ is ingested one would suspect to get 1001_ back, right? If I understand it correctly, right now one would get 100__, right? Given the semantics introduced by the indicators, ignoring This them would loose information effectively.

In the current system this would not happen at least not if one stores the ingestion format as is. And as all processes are working on the ingested format updates to the records would be processed properly and thus keep the format.

Any thoughts on this yet?

jirikuncar commented 9 years ago

@aw-bib it's possible to create specialized parsers based on the indicator values:

Example for 1001_:

@marc21.over('main_entry_personal_name', '^1001.')
@utils.filter_values
def main_entry_personal_name(self, key, value):
    """Main Entry-Personal Name."""
    surname, forename = value.get('a')  # NOTE issue with repeatable subfield
    return {
        'surname': surname,
        'forename': forename,
    }
fjorba commented 9 years ago

Thank you, @aw-bib, for raising those issues.

My opinion is that, as our databases are feeded with sources of various qualities, processing should be tolerant and intellingent creating useful indexes out of messy data, but respectful with the records themselves, because some of them are manually curated and of high quality.

Thus, answering first your second question, yes, we need to know, @jirikuncar, if the records keep the indicators when exported or displayed.

And, your about first question, @aw-bib, I don't have an opinion. Strictly speaking, those non-normalized authors, or when written in direct order, should go to 720 tag (http://www.loc.gov/marc/bibliographic/bd720.html), but probably not everybody follows this convention. So, your suggestion about treating the , as an unsignificant character depending on the indicators, as it seems that is possible according to @jirikuncar, gives some opportunities. But again, having messy data, probably automating guessing according to values (like the existence or not of the comma) could give correct answers.

aw-bib commented 9 years ago

@jirikuncar

it's possible to create specialized parsers based on the indicator values:

Understood that. @martinkoehler has some concerns here, that if some instance does this and another does not we get Marc in two distinct flavours but also with two different internal data models. So records may be hardly interchangeable. And/or if you later on find that you should do it that way for whatever reason you touch the internal data representation. This sounds like a migration.

So: should something like this go into the default right away? Ie. is the internal data model something that should know the concepts for surname / forname or should it live on the concept of some string related to a name. (We may find similar things for a bunch of fields, 100 is just some random example.)

If we go for a more detailed parsing, we keep the semantics of incoming data. How should dissemination deal with it then down the road? Has it to learn to reassemble to 1_?

@fjorba I think the current implementation would treat your messy data issue correctly. It just dumps the $a to personal_name and gives it to elastic search to deal with the words in this text with whatever magic ES does to strings. My understanding is that ES is ignorant of the meaning of those strings.

tiborsimko commented 9 years ago

1. WRT search, this is instance dependent. E.g. there is a canonical example in INSPIRE where there are two persons, "DENIS, Bernard" and "BERNARD, Denis" that we don't want to mis-match when people search for authors. This is achieved currently by the special author name value tokeniser. In Elasticsearch, this will be achieved by customisable analysers. IOW, each installation would be able to choose how they want to match names.

(Note also that "JSON representation of a record in the record store" may differ from the "JSON representation of the same record in Elasticsearch". E.g. one may want to enrich record's JSON with search-related information for discovery, which is useful for second-order queries, or for expanding linked authority information, for example.)

2. WRT exchange of data, this is touching our earlier musings what is a "master" format in a given Invenio instance. (Or perhaps let's call it "editable" format.)

Here is previous graph on this topic:

invenio_master_format_discussion

So, in which layer lives your installation? Interested in editing only MARC? Or only JSON? Or both MARC and JSON interchangeably? Considerations like these may help in designing the best data model for any given Invenio instance.

(CC @Kennethhole that also faces similar issues)

aw-bib commented 9 years ago

So it is more like this (I did not get this clear till now):

One speciality might occur if the internal JSON is "by chance" (this chance can probably happen in some cases by design) identical to the master format. Then all this applies but doJSON is basically mapping 1:1 and does nothing.

Doesn't that imply also, that something like webdeposit needs the reverse mapping of it's fields back to the master format to get the storage representation? Say, for an initial submit it would need to produce the master format as the record does not yet exist. In modify conditions it would need to store data back to the master format otherwise I'd export an outdated record.

Deposit could most likely simplify, however. To stay in the example, if deposit has two fields for forename and surename it could reproduce them to 1001_ by a = "%s, %s" % (surename, forename) etc. Usually it is expected that the end user interface drops some richness of beasts like Marc e.g. by normalizing in the first place. So deposits functions would live on

record = {
      'forname' : 'ForeName',
      'surname': 'SureName'
}

and then some magic function transforms this to storage where doJSON is applied again.

tiborsimko commented 9 years ago

@aw-bib Mostly yes, but not all statements are always true. Depends on an installation. Let me address just one example of yours. E.g. the new deposit module creates JSON internally, so (1) if a site decides to store it like that, then the "master format" for deposited records would be JSON, strictly speaking. (There was no MARC created by the submitter, no MARC coming up on the input end.) In this case, if one would like to export the deposited record as MARC, then MARC will be a "slave format" generated out of internal JSON storage, strictly speaking. One could do it exacty as you mentioned. In this case, one has to pay close attention to MARC <-> JSON convertibility when deciding upon a data model. (2) However, an installation may also choose to plug MARC generator in the deposit chain, so that "master format" of deposited records would be MARC, so to speak. (One could do it by modifying deposit workflow and adding last step that would create MARC out of "intermediate deposit JSON"; kind of like the old BibConvert was generating MARC from submitted form fields.) Or one could even imagine a deposit textarea where people could type MARC directly, or people using MARCEdit on Windows and sending MARC by upload API... in all these cases, the "master" format would be MARC, because this is what the human actors would "see" and "edit". An installation would be free to implement one way or another, depending on pros/cons for their concrete use case.

aw-bib commented 9 years ago

Mostly yes, but not all statements are always true.

I understood, and just did not make this explicit, that indeed the internal JSON can coincide with the master format. Probably something what most devs would prefer.

MARC will be a "slave format" generated out of internal JSON storage, strictly speaking. One could do it exacty as you mentioned. In this case, one has to pay close attention to MARC <-> JSON convertibility when deciding upon a data model.

Of course. So my feeling is, that if you e.g. have to handle a libraries catalogue or some instance where you get a bunch of external data in Marc (e.g. for ebook packages and whatnot) and need to exchange with Marc in both directions, one would be better of to treat Marc as master format. OTH if say some archive is interchanging data with other archives one is probably better of with EAD or something on those lines.

In other words: one could probably (surely ;) describe a vase in Marc, but if the whole surrounding is speaking another language it just adds complexity to handle all those translations. These are also our main concerns (as a library) if the master format differs from Marc.

One could do it by modifying deposit workflow and adding last step that would create MARC out of "intermediate deposit JSON"; kind of like the old BibConvert was generating MARC from submitted form fields.

As far as I understand it, this would be the way to go if you allow webdeposit while keeping Marc as a master format. This is about the join2 setup. We get a lot (in some cases almost all) of the "cataloguing" via websubmit from the campus and now move towards adding the catalogue data or in some instances even have this already.

Point is, of course I don't only want to stick to Marc just to nag developers, right? In fact I'm pretty sure, that Marc will be replaced in the foreseeable future by some shiny new format that doesn't want to print catalogue cards or microfiches anymore. Be it bibframe or even something completely new. Just the time scales of "foreseeable" between librarians (thinking in years, decades and centuries) and developers (thinking in days, weeks and probably months for long term projects) do not coincide. So in my world I'll have to live "for ages" (as a developer might see it) in Marc while as a librarian I'm at the brink of conversion to something completely new. And we'll have to do this without any resources. So we have to keep an eye on the budget (ie. dev capacities).

E.g. my simple bean counting tells me that join2 currently has 9 devs (not FTE, it's more like "people who can write code if needs be and thus have access to dev git"; most of them only with very limited resources, so we are about well, 3-4 FTE at best). But we have 16 librarians (here we are at FTE), counting only those with a github account that contribute tickets from the trenches, plus some more that pass the ticketing on to their colleagues.

fjorba commented 9 years ago

Thanks, @tiborsimko, for making it clear that Marc21 can be the master format for an Invenio instance.

One thing that I'm afraid that those enthusiasts for the-new-bib-format-that-yesterday-somebody-talked-about is that there is a huge work for implementing mappings. The syntactic on-to-one equivalence is already large, but mechanical. But the real mismatch happens at the semantic level. Even with well stablished formats like Marc21 and Dublin Core, there is no clear equivalent between 1XX, 7XX, Creator and Contributor fields. Much less if there are subfields (like affiliation, but there are many more of them). See http://www.loc.gov/marc/marc2dc.html.

Marc21 has one big advantage over all them: it has been used, polished, explained and for decades, from experience, buttom up. It provides quite good solutions for real world problems, including specialised cataloguing. JSON is (a very nice) syntax, but, which JSON semantics are we talking about? I (admitelly) random searches don't bring me a clear answer. JSON-LD? BibJSON? Marc in JSON syntax? Who will define the fields and subfields, and make the mapping? You, @jirikuncar? Shouldn't it be done by librarians?

It is true, like @aw-bib states, that Marc21 will probably be replaced by something new, and there is work going on with bibframe. But, as I understand it, the real thought is discussed at semantic and aggrupation level, not syntax, that can be anyone as it is flexible enough (and JSON is great in that respect).

tiborsimko commented 9 years ago

I (admitelly) random searches don't bring me a clear answer. JSON-LD? BibJSON? Marc in JSON syntax? Who will define the fields and subfields, and make the mapping? You, @jirikuncar? Shouldn't it be done by librarians?

The complete mapping of MARC21 to JSON is provided by this very DoJSON package. What we did is, basically, that we took LoC MARC standard and used those exact field/subfield names as they appear in the standard, together with indicator values and repeatable/non-repeatable flags, in order to create our JSON. Hence the result should be very library friendly :) and very neutral as to the naming of entities. (Certainly much more than previous CERN MARC customised schema that Invenio v1.x comes up with by default.) See also NYPL's https://github.com/thisismattmiller/marc-json-schema upon which our work was roughly based.

fjorba commented 9 years ago

Ok, now I see! So when you, Invenio developers, talk abut JSON, do you just mean Marc21 in JSON syntax in a 1:1 equivalence (and with long verbose English expression to substitute the terse Marc21 tag numbers and subfields)?

(I'm glad to see that even specialised tags, like music incipit fields and subfields do appear in your package, like https://github.com/inveniosoftware/dojson/blob/master/dojson/contrib/marc21/fields/bd01x09x.py#L335)

tiborsimko commented 9 years ago

So when you, Invenio developers, talk abut JSON, do you just mean Marc21 in JSON syntax in a 1:1 equivalence

Generally speaking, we talk about JSON representation of a record in broad terms, but always representing record's "master format" faithfully. (1) E.g. if an installation chooses EAD as their one and only master format, then they can write EAD-to-JSON mapping and they are done; they don't have to pass through MARC21 conversion anywhere anymore in order to profit from some features of the system, as was often the case with Invenio v1.x. (2) Hence if an installation chooses to use MARC21, then we indeed essentially talk about complete representation of "MARC-in-JSON". Many Invenio 1 users using MARC and not other master formats would be in this situation. Here one can say that DoJSON will offer complete representation of MARC in JSON. But it is only one of options, hence contrib.marc21. More will follow.

I'm glad to see that even specialised tags

Yes, that was precisely the goal, to have as complete MARC21 support as possible out of the box.

fjorba commented 9 years ago

Great, thanks again for your patience and long explanation.