inveniosoftware / dojson

Simple pythonic JSON to JSON converter.
https://dojson.readthedocs.io
Other
10 stars 29 forks source link

marc21: 100 $a R vs NR #23

Closed tiborsimko closed 9 years ago

tiborsimko commented 9 years ago

We seem to have a problem with repetitive (=R) vs non-repetitive (=NR) subfields.

Consider the following MARC21 record:

  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Donges, Jonathan F</subfield>
  </datafield>

It is converted into:

 'main_entry_personal_name': {'personal_name': ['Donges, Jonathan F']},

Note that personal_name is a list (so R), while MARC21 standard says that it is 100 $a - Personal name (NR), hence not a list.

fjorba commented 9 years ago

Do you really think that the internal implementation syntax should make a distintion about repetitive and non repetitive fields? Why don't you just consider them all them repetitive, thus lists? Internal representation would be radically simplified. And, if somebody is interested in doing a strict check about the repetitive fields or not, it can be done externally.

It reminds me the indicators issue. It seems to me, with my librarian attitude, that you, the programmers, take advantage of some possibilities that the implementation gives (call it JSON or Python, it doesn't matter) to implement it just because it can be done.

Please remenber the Postel principle: Be conservative in what you do, be liberal in what you accept from others.

If a librarian decides (or makes a mistake) to repeat an unrepeatable field, please let she do it. This field or subfield may be moved to somewhere else later, but for now, this value may be useful in this (strictly incorrect) non-repeatable field or subfield.

Would you agree, @aw-bib?

aw-bib commented 9 years ago

I thend to agree with @fjorba. If all subfields are treated repeatable nothing breaks as just one value is still valid.

There is, however, one point to be kept in mind. If one has deposit base submissions (or a bibedit, that enforces the rule), such a wrongly repeatable field might die silently in the deposit functions. Thus,probaply a checker for such issues could be helpful.

@martinkoehler any points here?

martinkoehler commented 9 years ago

I agree with @fjorba and @aw-bib. Especially since I could imagine that a NR field might be changed to R. If this is fixed in the data format this would be a disadvantage. In other argument in favor to treating all subfields as R in the datamodel is the marcxml representation. As far as I can see the schema http://www.loc.gov/standards/marcxml/xml/spy/spy.html does not specify R or NR. This is a decision on (software)layers above. IHMO the interface that accept data to store should check the validity (which includes R vs NR, ...) and all manipulations should/could be done through such an interface. BTW: Such a design would also make it easy to implement rules for 9xx fields. Since these are user defined LOC does not define repeatability. However if an installation decides e.g. to store in an subfield in this range something NR (for them), while someone else uses this as R, I think it would be better if this does not result in a "new" incompatible internal format.

tiborsimko commented 9 years ago

If a librarian decides (or makes a mistake) to repeat an unrepeatable field, please let she do it. This field or subfield may be moved to somewhere else later, but for now, this value may be useful in this (strictly incorrect) non-repeatable field or subfield.

Thesis: I fully agree that it is advantageous in many use cases to be very liberal and to accept any field and any subfield as repetitive. This is roughly what we did for the CERN Open Data project. It was very handy especially for custom fields, about which it was not known in advance whether they will be repeatable or not, and that were changing a lot until we settled on the data model. Basically, we have treated almost all fields and subfields as repetitive, exactly as you suggest.

Antithesis: allowing everything is disadvantageous however, because this does not prevent mistakes from being born, if a schema is known. Imagine a cataloguer who would like to insert two 100 fields; with a very liberal schema, nothing to do, this is allowed. With a more strict schema, the editor would recognise this as not matching the schema and therefore would raise the warning back to the cataloguer, not allowing this mistake to happen in the first place.

Synthesis: each site is free to choose the data model that fits their use case the best. E.g. one Invenio site may want to stick to strict MARC21 standard as it is described on those pages, setting up a stricter contrib.marc21 schema, preventing them to make mistakes at the price of having to enter appropriate information from the get-go. E.g. another Invenio site can opt for a permissive contrib.marc21liberal (say) schema that would not check much of anything, that would allow all indicators without paying attention to their meaning, all the fields and subfields to be repeatable, etc. In this way the cataloguing can be very loose, at the price of not being able to catch mistakes such as missing mandatory fields in the "heart" of the system.

Each site has a choice which schema to use, either more strict or more liberal, so what we are probably discussing here is what would constitute a good default? To me, if we decide to support MARC21 standard as it is described there, then we should behave as close as possible to that text, hence distinguishing repeatable vs non-repeatable status. In this way, we profit from the JSON Schema capabilities to the full. However, nothing prevent us to have another more liberal schema, as we ourselves did for the CERN Open Data portal.

[...] Thus,probaply a checker for such issues could be helpful.

This is a good example of postponing the checking work (that stricter JSON schema would do for you "for free") for later. If you take a stricter schema, then the checker is already inherently embedded in the record upload step, so all the uploads are valid, all the clients can be warned, etc. If you take a liberal schema, then developers must write special checking programs (such as the old BibCheck) that would go over the records later. I would argue that we are saving more FTEs with the first approach than with the latter, because we can take advantage of already-existing JSON-Schema based checking tools, such as tv4. We don't have to invent our own language for common checks, we don't have to write much custom code ourselves (like with bibcheck).

In other argument in favor to treating all subfields as R in the datamodel is the marcxml representation. As far as I can see the schema http://www.loc.gov/standards/marcxml/xml/spy/spy.html does not specify R or NR.

This XSD seems akin to the very liberal schema I mentioned above, that would allow for records with several 100 fields, without any 245 field, etc. Is it really advantageous to be so permissive and to leave all the checks for later? In my eyes, this does not profit from all the goodies that a richer JSON Schema offers. See e.g. live JSON Editor example. With richer data model schemata, a site can customise the JSON schema for the particular record type, e.g. if the site introduces new collection describing their music or their satellite libraries or whatever, they only have to define a rich schema and they will have a very rich, check-friendly inputting interface already done for this new collection out of the box.

BTW: Such a design would also make it easy to implement rules for 9xx fields

The local fields are not targeted in this package, as every site can define them as they want. An Invenio site will be able to define several JSON schemata describing their local collections and local fields and then "compose" their complete data model in this way, kind of like joining basic MARC standard with their local MARC standards. (Note that each record can be of different record type, matching different JSON schema.)

However if an installation decides e.g. to store in an subfield in this range something NR (for them), while someone else uses this as R, I think it would be better if this does not result in a "new" incompatible internal format.

In order to promote collaboration, we have started invenio-jsonschemas registry where we'll encourage people to publish their schemas and reuse them. E.g. we can have there hep-metadata-common-0.7.json describing basic commonly-shared metadata fields in HEP community, that some site can combine with hep-analysis-final-states-0.3.json describing high-level information about final state particles in a HEP analysis, cds-photo-1.4.json for multimedia collections, desy-keywords-1.4.json describing keywords, etc. Just an example for illustration.

aw-bib commented 9 years ago

allowing everything is disadvantageous however, because this does not prevent mistakes from being born, if a schema is known. Imagine a cataloguer who would like to insert two 100 fields; with a very liberal schema, nothing to do, this is allowed.

I perfectly agree on this point.

It is generally disadvantageous in the case of interactive cataloguing. I also agree, that we had in fact to write a lot of (way too much) code to get websubmit at join2 to a stage where we can almost rule out that the user keys in something wrong. I'd love to drop all this code. Of course every unwritten line of code is a good line. (It can't be wrong :) I understand that a strict schema could be very helpful here.

I also understand, that a strict schema would be enforced in deposit, upcoming bibedit and friends, so I see the advantage in all those use cases. I'd also buy and sign that we really want to have it there even for our projects just to make sure we know what the data is, that we hold. So, yes, in general a strict schema would be something we're looking for. Especially, for depositish work flows where non-librarians add data to the system.

With a more strict schema, the editor would recognise this as not matching the schema and therefore would raise the warning back to the cataloguer,

What will happen with a strict schema if I do not have interactive cataloguing?

E.g. you get foreign data, say from some publisher or from some other invenio installation?

Probably, here we come to the point of concerns. If the schema is not too strict you can add data that does not follow all the rules and still get meaningful results, if it follows most of them. Usually, our data is digested by humans, and not machines. Say, you'd probably rule out two 100s while you'd allow two $a subfields in 100, where the cataloguer just used the second $a for a pseudonym e.g.

With a strict schema I understand that you have to clean all the data before ingestion is possible. This can become quite a task, depending on what you get. This may even involve a lot of manual work. To stay in the example of the author, say your source catalogued pseudonyms in a second $a like this:

1001_ $a Loriot $a Bülow, Bernhard Victor Christoph-Carl von

Here, from a Marc point of view as well as an IT point of view the data is definitely wrong. I understand, that it would get refused by a strict schema, and of course we'd most likely not accept this from a deposit. Most likely one wants to end up at

 1001_ $a Loriot $0 (DE-588)118729101 $2 gnd

where the full form of the name is resolved via the authority, that you'd need to ingest first. It could, however, well have been, that the second $a should survive. Just making the point that I can not really clean this automatically, as it requires knowledge about the preferred form of the name. For the user, however, the wrong data would still be meaningful and even map at least a part of the authority link, while for the cataloguer it might help clean up the mess. So, in a way two $a indexed to the same index are wrong, but not as wrong as two 100 fields going to first author index. (I think this is the point of @fjorba)

So, in a way a strict schema can be a disadvantage if you get data that is perfect in almost all respects, but just fails to validated against some less important field. (Side note, thinking about a libraries catalogue. If its about licensed content which you'll loose after n months you'd not want to clean up all the data just to throw it away later on. At CDS the PDA records could also be of this kind. No need to clean them up, as most of the books are never bought.)

It is a disadvantage of Marc that it's not hewn in stone. And it is a disadvantage of librarians that they use words like "should", "can", "would be nice" in their standards. (Standard should only exist in singular anyway, in a perfect world.) That's one point why we nearly never get perfect data to start with.

To me, if we decide to support MARC21 standard as it is described there, then we should behave as close as possible to that text

A general +1.

This will, however, be quite a task. Maybe, this is another point of concerns here. If it is possible and feasible to do, well then I agree, it's the time to do it here.

BTW a strict schema would address a problem of @martinkoehler. It would in the end result in more resusable code for display, indexing etc. as a loose one. Simply cause it makes the rules clear. This is a general concern that arises if one has more than on master format.

I'm not sure if it is possible to end up having both worlds. A strict and a looser schema. Thus I could (if needs be) add data via the loose schema (say for the PDA collection) and validate it against the strict later (once the book is bought and get's real cataloguing) and clean the data by and by, thus effectively using the schema to as the checker I mentioned.

@tiborsimko I'm not sure that I got you correctly on this point. I understand that there were some discussions about more than one master format within one installation. Most likely this is not feasible and would not have to many use cases if the two formats differ by far. For the above it could solve some practical problems.

fjorba commented 9 years ago

I agree with @aw-bib. I was also specially concerned by batch loading from less than perfect sources, being OAI, Excel spreadsheets, homegrown databases or whatever. Error messages may get lost in a forest of log messages that not everybody has access. A syntactically incorrect record already in the database is accessible for a much larger group of persons.

In our workflows, it is far better to have the records, so they can be seen and searched, even if there are errors, than having to wait for weeks to get perfect records. But again, this correctness is only from a syntax point of view, no program will check if the record is correctly catalogued. The example @aw-bib has written about a pseudonyms is a good one, and a typical case if data comes from DC or simpler schemas. This record can be found, it makes sense, is useful, and can be fixed later (this is another case of the Release early, release often motto). And it may not be obvious to fix all the cases programatically.

I think that you are putting too much importance in this syntactically corrrectness aspect. Records are useful because their semantically meaning, not because they pass all syntax restrictions.

After reading all this discussion, I think that I was leading to our current implementation: our current system accepts all tags, all indicators, all subfields, repeated or not, but only those that are listed in the indexes tables, in the display parameters or conversion rules get indexed, displayed or converted to foreign formats. So errors don't hurt, don't get propagated, don't affect anybody else, they are just our problem, our TODO list. For us, this is very practical. Can this be described in a contrib.marc21liberal schema, @tiborsimko?

Maybe a compromise would be that implement those restrictions as warnings, not errors, and append those messages to some (configurable) local tag.

martinkoehler commented 9 years ago

Having had a closer look at the dojson code I think I understood a litte bit more. The definition of the internal format contains the semantic rules (which - as is written above by @tiborsimko enforces these rules on a deep level).

While I perfectly agree that every instance should enforcing as much of these rules as possible, I have some points of which I hope they can be clarified:

First: If I look at at http://www.loc.gov/marc/bibbas99.html as an example (The complete list of changes is at http://www.loc.gov/marc/status.html ) you can see that there were regularily "Changes in repeatability". It seems for me that this implies that if I want to keep "up to date" with the LOC Marc scheme, everytime there is a change I have to change my internal storage format, which - correct me if I'm wrong - leads to a complete data migration.

Moreover, since the way my programs access the data structure, if a field changes from R to NR my programs do not work anymore, since the datamodel now returns a e.g. string instead of a list. That means that in some abstract view the API changes with the format definition. My gut feeling is that the more stable an API, the better. So I would prefer something where the API is far as possible independend of such definitions, e.g. lists in this examples, whereas the checking is done via the API, e.g. if I try to save a list with more than one entry in the NR case the API should/could return e.g. an error (depending on the rigidity)

I had this kind of scenario in mind when talking about a more liberal storage model.

With respect to the registry: IHMO this is very important. It would probably not only contain the schema, but also all "helper code", as well, since these codes are schema dependent. I do not know how much "helper code" exists and how much of the invenio "core" code will turn out to be schema dependend as well.

aw-bib commented 9 years ago

Maybe a compromise would be that implement those restrictions as warnings, not errors, and append those messages to some (configurable) local tag.

This sounds a reasonable approach. Even tagging for some sort of revision list (aka holding pen collection) sounds reasonable.

Still, I think, for manual curations/ingestions, be it in bibedit or deposit, we really want strict schema, right? So the curator, once she has the time to clean up the mess we got in the first place should get a helpful hand by the schema.

E.g. while I was playing with the JSON editor demo, I tried to add a snake as pet. (Of course I didn't fiddle with the form, but used the JSON right away.) I got an error as snakes are no pets that where defined. They should either be other or defined as valid option. This sound quite helfpul for me thinking about vocabulary controled inputs and in the direction of LOD. (Most likely, however, the cataloguer should not edit the schema right away as it is possible in the demo.)

@martinkoehler's point

checking is done via the API

seems to be address somewhat in inveniosoftware/invenio#3345, especially taking @kaplun's comment into account. As mentioned there I think one is searching the sweet spot of both of these.

tiborsimko commented 9 years ago

Replying to @aw-bib's https://github.com/inveniosoftware/dojson/issues/23#issuecomment-120062493:

What will happen with a strict schema if I do not have interactive cataloguing?

The record upload would be refused and the client (e.g. interactive editor, e.g. programmatic workflow) would be informed about this exception. Then it depends on how the exception is handled by the client: e.g. interactive editor can show the error, e.g. automated upload process can refuse the record entirely, or it can populate holding pen or some other staging area where "unclean" records sit. Depends on how people write their automated workflows and how they handle the exception there.

Side note, thinking about a libraries catalogue. If its about licensed content which you'll loose after n months you'd not want to clean up all the data just to throw it away later on. At CDS the PDA records could also be of this kind. No need to clean them up, as most of the books are never bought.

In this case I'd just create a different schema for PDA records and use that. Please recall that a site can use many different schemata in production; each collection of records can conform to a different schema. As outlined above, a site can use cds-photo-1.4.json for photos, cds-book-0.8.json for books, etc. In this way a site can support both MARC "master" records for the library and EAD "master" records for historical archives in the same installation. The same "schema denormalisation" process don't have to stop at the level of MARC or EAD standards; one has all the advantage of going deeper and specify one "MARC sub-schema" for books, one "MARC sub-schema" for videos, one "MARC sub-schema" for dirty PDA records, etc. (The advantage being an out-of-the-box validation for each "differently maintained" collection.)

I'm not sure if it is possible to end up having both worlds. A strict and a looser schema. Thus I could (if needs be) add data via the loose schema (say for the PDA collection) and validate it against the strict later (once the book is bought and get's real cataloguing) and clean the data by and by, thus effectively using the schema to as the checker I mentioned.

Yes, exactly! The PDA records would live in a separate collection that follows a liberal schema, and when a cataloguer decides to move some PDA record to the regular Books collection that follows a stricter schema, he/she would have to resolve the conflicts before he/she is allowed to do so by the system.

tiborsimko commented 9 years ago

Replying to @fjorba's https://github.com/inveniosoftware/dojson/issues/23#issuecomment-120234040:

I think that you are putting too much importance in this syntactically corrrectness aspect.

What I care about is to offer tools so that every Invenio installation can easily configure their data model in a way that would match their particular use case and particular workflows and cataloguing habits. If you want to accept any MARC with any repeatable fields and subfields, there is absolutely no problem -- you can use a liberal schema like the XSD one that @martinkoehler mentioned. (Recall that this is what we ourselves do for the CERN Open Data use case, as I highlighted above.) If another site wants to use a strict MARC21 schema conforming to the MAR21 LoC guide, again there is absolutely no problem -- Invenio also aims at supporting this use case as well. (And this issue was born because we don't support it fully yet regarding R vs NR subfield statuts.) If you want, you could say that I care about strict conformance to the MARC21 LoC guide only within the context of the second use case, but absolutely not within the context of the first use case :smile:

tiborsimko commented 9 years ago

Replying to @martinkoehler's https://github.com/inveniosoftware/dojson/issues/23#issuecomment-120237223

that there were regularily "Changes in repeatability". It seems for me that this implies that if I want to keep "up to date" with the LOC Marc scheme, everytime there is a change I have to change my internal storage format, which - correct me if I'm wrong - leads to a complete data migration.

Yes, one would have to migrate the data if one switches from a "MARC21 Update 19" schema to a "MARC21 Update 20" schema where R vs NR status for some subfields changed. In practice, are libraries switching often between these various MARC updates?

Moreover, since the way my programs access the data structure, if a field changes from R to NR my programs do not work anymore, since the datamodel now returns a e.g. string instead of a list.

Yes, this was precisely why we selected a more liberal schema for the CERN Open Data use case.

tiborsimko commented 9 years ago

Replying to @aw-bib's https://github.com/inveniosoftware/dojson/issues/23#issuecomment-120238316

checking is done via the API [...] seems to be address somewhat in inveniosoftware/invenio#3345

Yes, more complex checks are the domain of the "BibCheck" style checker module. The schema-based checks are run always, use generally available libraries, and aim at getting the basic semantics right. The checker-based checks are usually run a posteriori, use our own custom code, and aim at getting any deeper semantics right (e.g. complex inter-field dependencies and whatnot).

kaplun commented 9 years ago

Replying to @tiborsimko's https://github.com/inveniosoftware/dojson/issues/23#issuecomment-120293127

Yes, one would have to migrate the data if one switches from a "MARC21 Update 19" schema to a "MARC21 Update 20" schema where R vs NR status for some subfields changed.

In fact I believe upgrading records to respect updates in schema is going to be a recurrent task. For that we will need to have a place and a pattern to create "record upgrade recipes" (and it should be able to replay this upgrades also when recovering a record from a previous outdated version. I guess this is the subject of a different RFC, though.

fjorba commented 9 years ago

Replying to @tiborsimko's

In practice, are libraries switching often between these various MARC updates?

My understanding, for what I can see from my library colleagues, is more a matter of conventions. For example, an easy example from the page that @martinkoehler mentioned (http://www.loc.gov/marc/bibbas99.html), whether $u in 856 changes from repeteable to not. Probably it means that after some practice, there is a consenus that is better to repeat the whole 856 tag, so that the librarian can add notes for each url than a list of urls with a single note. Does it mean that all previous records (maybe a lot of them) are suddenly invalid, and thus, moved somewhere that makes them invisible? Probably not, but rather that it is better to change them to the new practice.

That's why I think that this check should be treated as a warning, not an error.

tiborsimko commented 9 years ago

In fact I believe upgrading records to respect updates in schema is going to be a recurrent task.

(Yes, that's why we aim at having schema version numbers from the get go. In theory, one could imagine a co-existence situation where some of the records of the Books collection would follow cds-book-0.8.json schema and some cds-book-0.9.json schema for longer times, with OAIS/DIP store taking care of display formats etc -- but in practice it is much more practical to have collections and "doctypes" matching. An administrator would often issue a command like invenio records migrate cds-book-0.8 cds-book-0.9 so that all Books records would follow the same schema.)

kaplun commented 9 years ago

An administrator would often issue a command like invenio records migrate cds-book-0.8 cds-book-0.9 so that all Books records would follow the same schema.)

:+1: !

fjorba commented 9 years ago

An administrator would often issue a command like invenio records migrate cds-book-0.8 cds-book-0.9 so that all Books records would follow the same schema

In my longish experience, I have seen quite a few migrations, among several Marc formats (ISISmarc, ISDSmarc, Ukmarc, Catmarc, USMarc, Marc21) and library vendors particularities. Maybe only half of them can be changed automatically. The others need manual library inspection and correction, because what it changes is the criterium, not a mechanical search-and-replace change.

Again, if they are treated as a warning, not an error, the problem diminishes automatically to an internal library issue, not an end-user limitation, in the sense that those records are no longer available.

kaplun commented 9 years ago

Maybe only half of them can be changed automatically. The others need manual library inspection and correction, because what it changes is the criterium, not a mechanical search-and-replace change.

@fjorba you have to think that the upgrade and changes in the data model are not going to be pushed by the Invenio team, but they are going to be totally under your control (the librarian control). It's up to you to decide how to evolve the data model in your instance and provide suitable upgrade scripts. But sure, if an upgrade implies some further restriction in the data model, it will probably have to be handled manually. On the other hand in case of relaxation of the schema (say a field is changed from NR to R, hence e.g. from single string to list of strings), then the upgrade is very straight forward.

tiborsimko commented 9 years ago

@fjorba Also, one could consider that the format conversion from ISISmarc to Marc21 is an important job after an important decision, while the typical local format migrations would be supposedly less dramatic, a kind of organic evolution of the original schema. (Say a site decides to move from book-0.8 to book-0.9 because of newly added field control vocabulary value.) As @kaplun mentioned, the migration will be in all cases decided by each individual site, not "imposed" by Invenio software suite, as it were.

Anyway I have well registered your call that Invenio should offer another, very liberal MARC21 schema, in addition to the current strict MARC21 one, in an out-of-the-box manner.

fjorba commented 9 years ago

Anyway I have well registered your call that Invenio should offer another, very liberal MARC21 schema, in addition to the current strict MARC21 one, in an out-of-the-box manner.

Yes, thank you!

(I think that there are too many computer people and too few librarians deciding on what to expect for a digital library software like Invenio.)

tiborsimko commented 9 years ago

(I think that there are too many computer people and too few librarians deciding on what to expect for a digital library software like Invenio.)

Nah, the computer people wouldn't bother implementing MARC21 LoC standard with its plethora of R and NR rules at all in the first place :smile: We listen to all kinds of usage scenarios, that's all. Just ask @Kennethhole about the level of MARC21 standard compliance that some installations need...

aw-bib commented 9 years ago

I think the main point here is, that one can indeed have a strict schema along with a liberal one on the same system. I believe that this addresses and solves most points, together with @fjorba's point of having a liberal schema along with the strict standard.

As @fjorba and @martinkoehler I'm not that sure that the migration between schemata is so easy as @tiborsimko suggests. Mostly, as @fjorba points out a change in the schema results from [this second part of the schema that gets neglected here all the time called rule book] e.g. RDA, RAK, AACR, what have you. Ie. cataloguing conventions. And those changes are usually not that easily done automagically.

Anyway, I also like the idea of an upgrade of schemas in case there is a 1:1 mapping possible, or if only trivial changes are required. That should address many points of @martinkoehler about this migration. This might handle all minor changes easily. And I also see a chance for curation if one thinks about moving non-validating data to some holding penish area. As @fjorba points out one has to consider to keep the stuff visible as this manual curation may take quite some time.

The PDA records would live in a separate collection that follows a liberal schema, and when a cataloguer decides to move some PDA record to the regular Books collection that follows a stricter schema, he/she would have to resolve the conflicts before he/she is allowed to do so by the system.

I admit that I really like this idea. I wonder, however, about the logic to bind the schema to a collection. Its only a feeling, but most of our records live in more than one collection. I wonder what happens if a record lives in collection a and b while both might use (slightly) different schemata. (I'm not sure that there is a/are many real world use case of widely different schemata on one instance.)

Or is lives in a collection more a metaphor for something like

"record_schema" : {
     "name":  "Marc21_strict",
     "version": "2015",
     "subversion": "RDA"
}

vs.

"record_schema" : {
     "name":  "Marc21_liberal",
     "version": "2010",
     "subversion": "RAK-WB"
}

I.e. I give them sort of a tag that specifies in which world the record lives, while they still show up in say a general collection like "publication database"?

tiborsimko commented 9 years ago

I admit that I really like this idea.

Me too :smile: and with everything being indexed in Elasticsearch, it opens new exciting possibilities about easy discovery of information coming from even very heterogeneous sources. (Invenio master branch already shows this in action.)

[this second part of the schema that gets neglected here all the time called rule book] e.g. RDA, RAK, AACR

We have indeed been looking at supporting MARC21 LoC only. Concerning RDA and other conventions on top of MARC21, one possibility is to have simple contrib.marc21 one-size-fits-all loose schema, leaving complex cataloguing convention checks for the later checker module. Another possibility is to create separate contrib.marc21rda, contrib.marc21rak, etc schemata, so that the checks would be inherently hard-included from the get-go. I'm advocating the latter option so that we could take full advantage of already-existing JSON Schema goodies such as tv4 validator or JSON editor.

I.e. I give them sort of a tag that specifies in which world the record lives, while they still show up in say a general collection like "publication database"?

Yes, that's very well possible. A collection concept is orthogonal to the schema concept. The same collection can be composed of records complying to different schemes.