inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
625 stars 292 forks source link

BibField: clean up `doi` field and other FIXME in `atlantis.cfg` #2075

Closed tiborsimko closed 8 years ago

tiborsimko commented 10 years ago

Similarly to #1557, there are more clean-ups to be done in atlantis.cfg. For example, in invenio.conf we have:

CFG_OAI_ID_FIELD = 909COo
CFG_OAI_SET_FIELD = 909COp
CFG_OAI_PREVIOUS_SET_FIELD = 909COq

while in atlantis.cfg we have:

@persistent_identifier(4)
oai:
    creator:
        @legacy((("024", "0248_", "0248_%"), ""),
                ("0248_a", "oai"),
                ("0248_p", "indicator"))
        marc, "0248_", {'value': value['a'], 'indicator': value['p']}
    producer: 
        json_for_marc(), {"0248_a": "oai", "0248_p": "indicator"}

and:

FIXME_OAI:
    creator:
        @legacy((('909', '909CO', '909CO%'), ''),
                ('909COi', 'group'),
                ('909COs', 'number'),
                ('909COo', 'id'),
                ('909COp', 'set'),
                ('909COq', 'previous_set'))
        marc, "909CO", {'group':value['i'], 'number':value['s'], 'id':value['o'], 'set':value['p'], 'previous_set':value['q']}

Notice the naming differences: e.g. "oai.indicator" should rather be called "oai.set"; "oai.value" should rather be called "oai.id".

Historically, 909CO MARC tag in Invenio Atlantis defaults was used to store OAI information. Later, 0248 was chosen on CDS and elswhere as a better MARC tag location. Invenio Atlantis defaults were not changed though, which permitted e.g. to test easily how Invenio is configurable for various site conditions.

However, as it is agreed that 0248 is better OAI location, we should probably move defaults there, so that new Invenio installations are using by default "good" MARC tags already.

(This is similar with 773 vs 909C4, etc, see #1557.)

P.S. See also other FIXMEs in atlantis.cfg. It would be useful to go through the file, fix all FIXMEs, and fix demobibdata.xml and democfgdata.sql and invenio.conf accordingly.

tiborsimko commented 10 years ago

CC-ing @aw-bib who may perhaps want to give the logical field and physical MARC tag configuration a look?

jalavik commented 10 years ago

In relation to this discussion, Annette Holtkamp had a look a few weeks ago on drafting up a possible default configuration. I have the initial draft here: https://github.com/jalavik/invenio/tree/bibfield-default-config-draft

Although it may change/remove too much, it can perhaps be a part of a more closer analysis/discussion offline. Perhaps a nicer merge (instead of replace) of current config and her draft shall be made..

tiborsimko commented 10 years ago

Thanks, I had a quick glance, here are random comments:

jalavik commented 10 years ago

Unfortunately, Annette is harder to reach now, but I can try to merge more nicely the changes she meant to do with what has been added/changed since then. I will also try to address your comments (on which I agree fully with) and confirm with Annette. Thanks!

egabancho commented 10 years ago

What do you thing about what bibjson proposed some time ago?

"identifier": [
      {"type":"doi",
      "id":"10.1000/182",
      "url":"http://dx.doi.org/10.1000/182"}
    ]

Maybe it will be a bit messy afterwards if one wants to process the identifiers.

egabancho commented 10 years ago

I'm trying to clean up all the field definitions, set proper names, docs, etc. but I found several inconstancies between MARC21 and Invenio names and, honestly, I don't know what direction to take.

There are a few fields with naming issues, like 037 which for MARC means Source of Acquisition and for Invenio is primary_report_number.

In this specific case, if we follow Invenio names the field would be something like:

primary_report_number:
    creator:
        marc, "037__", value['a']
    producer:
        json_for_marc(), {"037__a": ""}

On the other hand if we follow MARC21 definition:

source_of_acquisition:
    creator:
        marc, "037__", {'stock_number': value['a'], 'source_of_acquisition': value['b']}
    producer:
        json_for_marc(), {'037__a': 'stock_number', '037__b': 'source_of_acquisition' }

What do you think? Should we stick to MARC or should we "invent" our own names? @aw-bib, @fjorba do you mind taking a look at this?

tiborsimko commented 10 years ago

What do you think? Should we stick to MARC or should we "invent" our own names?

From the point of view of Invenio software, the best is probably to stay neutral in any naming games, i.e. to use preferably an already established schema. This would call for using MARC names, even if sometimes they may not be intuitive...

kaplun commented 10 years ago

@tiborsimko, @egabancho what is today the real implementation of a record? What is stored as master format? Is indeed MARC represented into the configuration in a way that is possible to map to and from JSON?

What about those fields that yesterday were stored in 9xx tags, and are therfore specific to Invenio (980 for collection, 999 for reference). Is it at all required to still specify a mapping towards MARC in the default configuration or can we happily just put an entry reference and collection? In this case how is this created/produce? How do you specify the sub-parts? It looks to me that one has always to specify a mapping towards at least MARC, for a field to define its structure... is this correct?

tiborsimko commented 10 years ago
  1. The recommended master format is MARC. Only in this way all the goodies and modules will work out of the box.
  2. If an installation would like to use some "less rich" master format, like DC, the recommended way is to convert it to MARC. In this way we are reducing the situation to the former one.
  3. If an installation would like to use some other "more rich" master format, say, EAD, without possible lossless conversion to MARC as the underlying format, then this is largely possible already; the basics of ingestion, indexing, and searching are working. However, some parts are not working (e.g. browse), some parts need to be amended (e.g. formats), and some parts will not work at all (e.g. cataloguing tools that are all MARC-oriented).
  4. If an installation would like to use a custom JSON as their master format, then this situation is very similar to the second situation. (I.e. getting more and more possible, but we are not there yet.)
tiborsimko commented 9 years ago

@Kennethhole and @egabancho have been working on this in recent months, so I wonder whether we can co-sprint on this issue to fix it before forthcoming v2.0.0 is out. The motivation is to start with relatively "clean" JSON model, so that we don't force sites to change their data model once this issue will be fully completed.

One of the biggest inconveniencies in my eyes is the behaviour of JSON field return types when no schema is declared. For example, it can return either dictionary or a list of dictionaries, depending on whether there is only one field instance, or more of them. It seems safer to always declare schema everywhere in order to always return the same type.

Proposal: before v2.0.0 is out, go through the JSON definition file, check used field names and change them to match MARC standard, check declared field repetitiveness in the MARC standard (R, NR) and enforce schema for every field accordingly. Example: for MARC tag 505, the CERN Open Data uses:

formatted_contents_note:
    schema:
        {'formatted_contents_note': {'type': 'list', 'force': True}}
    creator:
        marc, "505__", {'miscellaneous': value['g'], 'title': value['t']}
    producer:
        json_for_marc(), {"505__g": "miscellaneous", "516__t": "title"}

See also https://github.com/cernopendata/opendata.cern.ch/issues/450 and others.

kaplun commented 9 years ago

@MSusik (@glouppe) since you already started working on this for representing author records, I think it makes sense for you to also take part to such sprint.

fjorba commented 9 years ago

How do you express accept any indicator as valid (#1722)?

For example, how do you express main author, 100 (http://www.loc.gov/marc/bibliographic/bd100.html) or even more, main title, 245 (http://www.loc.gov/marc/bibliographic/bd245.html)?

MSusik commented 9 years ago

since you already started working on this for representing author records, I think it makes sense for you to also take part to such sprint.

Fortunately, our record definitions have schema definitions. Moreover, the author record fields are even more strict as I define schema for every subfield. Check https://github.com/inspirehep/inspire-next/pull/170/files for details.

fjorba commented 9 years ago

@MSusik, @kaplun, @egabancho, Sorry, but I fail to see how do you express that a 100 field with any first and second indicator is a valid author. May you help me to find it, please?

kaplun commented 9 years ago

@egabancho could it be that @fjorba question is answered by simply putting:

creator:
      marc, "100%%", {'last': util_split(value['a'], ',', 0),
                      'first': util_split(value['a'], ',', 1),
                      'numeration': value['b'],
                      'title': value['c'],
                      'birth_year': int_util_split(value['d'], '-', 0),
                      'death_year': int_util_split(value['d'], '-', 1),
                      'previous_name': value['i'],
                      'status': value['g'], 'preferred_name': value['q']}

?

fjorba commented 9 years ago

@egabancho, @kaplun , if it is the case, please, please, please, fill all indicators for all fields this way. And only, only, only change the default sctricty for those cases where it is absolutely needed.

Please. Or check with your friendly librarian.

And then you could close also ticket #1722, and be happy.

egabancho commented 9 years ago

@kaplun @fjorba the issue with the indicators in JSONALchemy has been solved on pu sometime ago ;-)

You could take a look at 7ce4c93e20009031962161dcea302895efaf1170 and there is an example of it in the tests

kaplun commented 9 years ago

@fjorba, as @egabancho mention the functionality is there (i.e. one just need to put "." as a wildcard instead of "_" for the indicators (see the example provided by @egabancho). Can you send in as one of our "friendly librarians" (since at CDS and INSPIRE and Zenodo don't have these use-cases), a patch for the default config where you need to accept "all" indicators? In this way we can integrate once and for all the support for MARC as expected :smile:

fjorba commented 9 years ago

@egabancho, @kaplun is right. An absolute Invenio newbie should be able to take standard Marc21 records, like those found at https://archive.org/details/ol_data, upload to her newly installed Invenio and expect them to be indexed and displayed correctly. If not 100% (Marc21 is very large), at least most of it. Yes, without fiddling with esoteric, home grown Invenio specific syntax. Learning those details should be only needed for special, non-standard use cases (maybe like CERN, but unlike the rest of the world).

fjorba commented 9 years ago

Sorry, @kaplun, I misunderstood your request. But I'm afraid I don't fully understand the syntax, as it seems to me that there are some stanzas for input and others for output, and moreover I cannot test it. I don't have a dev version for Invenio.

But again, if Invenio states that it complies with Marc21, and you developers know and master this syntax, why don't you fill all indicators for all fields with a wildcard character (whatever it is) so all of them are recognized as valid values?

After doing that, test it with a few thousands real world Marc21 records, like those found at the above archive.org address, to check that it works. This way, prospective new Invenio installations won't be frustated with unmatched claims.

kaplun commented 9 years ago

Sorry, @kaplun, I misunderstood your request. But I'm afraid I don't fully understand the syntax, as it seems to me that there are some stanzas for input and others for output, and moreover I cannot test it. I don't have a dev version for Invenio.

Indeed extensive documentation is still in the pipeline...

The issue, as you know, is that in our use case we have always considered MARC as a simple way to encode bibliographic metadata, without trying to respect the whole MARC standard. Therefore without the internal expertise to real polish everything, if we simply add wildcards everywhere, don't we risk to overdo it? Maybe indeed not, and adding wildcards everywhere is a quick-win...

fjorba commented 9 years ago

@kaplun, it is much, much, much better to add wildcards for all indicators, that the current situation, where for example no single title is displayed or indexed as title. Right, no single real world Marc21 title (see http://www.loc.gov/marc/bibliographic/bd245.html). It means maybe like going from 10% compliance to 90%.

Do you loose or gain, with that change?

Kennethhole commented 9 years ago

I agree with @fjorba in this discussion. By following the marc21 standard, indicators are used for most of the fields and should therefore work out of the box.

We are currently setting up Invenio master for a client and migrating their bibliographic records from Voyager, so we have plenty of records following the marc21 standards. I did not quite understand your request @kaplun, but we are happy to contribute back what we are doing to allow indicators.

kaplun commented 9 years ago

I am proposing a PR (#2651) to try to address the specific issue of #1722 but in naked invenio installation of pu there is basically no MARC defined (beside in the testsuite). I will also send a PR for invenio-demosite with similar proposed changes for Atlantis. Maybe that this solves @fjorba issue?

fjorba commented 9 years ago

Thanks, @Kennethhole. @kaplun, I'm a little bit lost with those PR and pu acronyms (I suspect that pu means «proposed updates», but nothing about PR), and a little scared with what I understand, where you propose «no MARC defined». Does it mean that an absolute Invenio newbie will be able to load her standard Marc21 records and a large percentage of their fields will be displayed and indexed out of the box?

For me, as a veteran Invenio user, having learned how to make indicators work the hard way, a better solution would be: please use the fields that you are using in your sub-standard demo records. But accepting any indicator. With this out or the box, we, the community, will happily send you patches to slowly complete as much Marc21 as possible. But we need something that works out of the box, please.

A second step would be: use real Marc21 records. There are millions out there, public domain or CC0, reusable and free to use.

kaplun commented 9 years ago

@fjorba PR stands for "pull request" :smiley: it's a GithHub slang. From what I am understanding, Invenio nowadays is being re-packaged into layers (and sometimes separate modules). With the demosite (Atlantis) available separately. If one does not install the demosite package, the base invenio package will contain basically no support for MARC (in terms of pre-existing configuration). So MARC users starting nowadays with Invenio will have to also take the invenio-demosite (i.e. Atlantis) and transform it to their needs.

@egabancho, @tiborsimko, is this correct?

jirikuncar commented 9 years ago

... no support for MARC ...

The plan is to create standard configuration for MARC and friends that will be shipped with invenio package as base for site customizations. (cc @egabancho)

egabancho commented 9 years ago

What @jirikuncar said is correct, we plan to create a contrib folder inside the record module where we should place any master format that we get from the community, being MARC the first of them. This way when you install Invenio you can decide which formats you want to enable, and therefore be supported by your installation, plus you will be able to override them and add new ones.

But, as @kaplun pointed out before, for this kind of standard configuration files the help of our "friendly librarian" would be very much appreciated ;-)

fjorba commented 9 years ago

@egabancho, I'll be glad to help, as long as I know what I'm doing ;-)

tiborsimko commented 9 years ago

The plan is to create standard configuration for MARC and friends that will be shipped with invenio package as base for site customizations.

... which is exactly the purpose of this very issue! We are cleaning up the default record JSON configuration file so that it would be MARC standard compliant (and MARC parlance compliant). See the clean-up plan outlined in https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-68843843.

Furthermore, once this plan is implemented, Invenio users will be able to do exactly what @fjorba proposes above: take any MARC record, upload it "as is", and expect things to work out of the box, without having to tweak supplied MARC configuration. @kaplun's comment about "no support for MARC" should not be taken literally on the user level; quite the opposite. MARC is the recommended master format to use, see https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-58026572.

kaplun commented 9 years ago

@kaplun's comment about "no support for MARC" should not be taken literally on the user level; quite the opposite.

Yes, sorry if I created confusion :innocent: And that's also why I called in you and @egabancho to correct me. My assumption was based on the current pu source code only.

kaplun commented 9 years ago

... which is exactly the purpose of this very issue!

Well not really, as this is issue originally talks about atlantis.cfg and that's where my very personal confusion stems from.

fjorba commented 9 years ago

Great @tiborsimko and @kaplun, thanks a lot!

Of course, I don't know anything about the Invenio internal uses of those configuration files, so may I ask: are they used also for those different record syntax exporting? And for OAI harvesting and publishing?

tiborsimko commented 9 years ago

this is issue originally talks about atlantis.cfg

Yeah, because if at all possible, we'd like to clean record JSON configuration both in master and next/pu branches, so that the stored JSON would be "almost" the same, if not fully the same, in Invenio 1 and Invenio 2 sites. In this way it will be easier for sites to upgrade, or try to run LABS site ot out of the same database. (Although not many modules use record JSON in Invenio 1, this is not that critical for master based sites; but it would be important for future 2.0 to 2.1 updates as well.)

tiborsimko commented 9 years ago

I don't know anything about the Invenio internal uses of those configuration files, so may I ask: are they used also for those different record syntax exporting? And for OAI harvesting and publishing?

In principle, one can produce various needed output formats out of JSON. Also, one can do it from the master format for each concrete record. The two approaches may be illustrated as follows:

   record 123 (=master format MARC)
                                   \
                                    \
                                     ------>  record JSON  --->  DC
                                    /
   record 124 (=master format EAD) /

versus:

   record 123 (=master format MARC) -------------------------> DC
                                   \
                                    \
                                     ------>  record JSON
                                    /
   record 124 (=master format EAD) / ------------------------> DC

In the first case, one standardises every possible master format on the input site to the same JSON, that all the various ouput channels (such as production of Dublin Core output format) use directly.

In the second case, even though one also creates record JSON and uses it for various core indexing and HTML displaying needs, the production of the Dublin Core output format relies on converting master formats, e.g. because one has MARC2DC style steet ready, or because it is otherwise more pratical to work with the record master format rather then standardised JSON.

The first option is still kind of "remoter" possibility, because many Invenio modules still assume that their master format is MARC.

BTW you can read some documentation about record JSON configuration format and possibilities at http://invenio.readthedocs.org/en/latest/modules/jsonalchemy.html

fjorba commented 9 years ago

Thank you, @tiborsimko, it looks promising!

jirikuncar commented 9 years ago

Should the correct milestone be 1.2.x or 1.x? In Invenio 2 we should wait to new pythonic JSONAlchemy coming in ~2.2.

tiborsimko commented 8 years ago

Superseded by Invenio 3. General MARC clean-up happened in https://github.com/inveniosoftware/dojson and https://github.com/inveniosoftware/invenio-marc21