Closed tiborsimko closed 8 years ago
CC-ing @aw-bib who may perhaps want to give the logical field and physical MARC tag configuration a look?
In relation to this discussion, Annette Holtkamp had a look a few weeks ago on drafting up a possible default configuration. I have the initial draft here: https://github.com/jalavik/invenio/tree/bibfield-default-config-draft
Although it may change/remove too much, it can perhaps be a part of a more closer analysis/discussion offline. Perhaps a nicer merge (instead of replace) of current config and her draft shall be made..
Thanks, I had a quick glance, here are random comments:
111
was killed,
but Invenio demo site still uses this field. Perhaps the file is
too INSPIRE specific? Ideally we'll have to come up with something
describing MARC more generally, so any site could use it.number_of_citations
to
cited_by_count
? Firstly, it breaks harmony with other
number_of_...
virtual fields such as number_of_comments
or
number_of_copies
; secondly "cited by count" is incomplete English and may read like "cited by
Count Dooku" :). Also, the string number_of_citations
occurs elsewhere in the code base, so this term would have to be updated elsewhere too as part of the same commit.author
field separation seems nice!additional_corporate_names
or authors
, but contributor
. The second should be plural as well.Unfortunately, Annette is harder to reach now, but I can try to merge more nicely the changes she meant to do with what has been added/changed since then. I will also try to address your comments (on which I agree fully with) and confirm with Annette. Thanks!
What do you thing about what bibjson proposed some time ago?
"identifier": [
{"type":"doi",
"id":"10.1000/182",
"url":"http://dx.doi.org/10.1000/182"}
]
Maybe it will be a bit messy afterwards if one wants to process the identifiers.
I'm trying to clean up all the field definitions, set proper names, docs, etc. but I found several inconstancies between MARC21 and Invenio names and, honestly, I don't know what direction to take.
There are a few fields with naming issues, like 037
which for MARC means Source of Acquisition and for Invenio is primary_report_number
.
In this specific case, if we follow Invenio names the field would be something like:
primary_report_number:
creator:
marc, "037__", value['a']
producer:
json_for_marc(), {"037__a": ""}
On the other hand if we follow MARC21 definition:
source_of_acquisition:
creator:
marc, "037__", {'stock_number': value['a'], 'source_of_acquisition': value['b']}
producer:
json_for_marc(), {'037__a': 'stock_number', '037__b': 'source_of_acquisition' }
What do you think? Should we stick to MARC or should we "invent" our own names? @aw-bib, @fjorba do you mind taking a look at this?
What do you think? Should we stick to MARC or should we "invent" our own names?
From the point of view of Invenio software, the best is probably to stay neutral in any naming games, i.e. to use preferably an already established schema. This would call for using MARC names, even if sometimes they may not be intuitive...
@tiborsimko, @egabancho what is today the real implementation of a record? What is stored as master format? Is indeed MARC represented into the configuration in a way that is possible to map to and from JSON?
What about those fields that yesterday were stored in 9xx tags, and are therfore specific to Invenio (980 for collection, 999 for reference). Is it at all required to still specify a mapping towards MARC in the default configuration or can we happily just put an entry reference
and collection
? In this case how is this created/produce? How do you specify the sub-parts? It looks to me that one has always to specify a mapping towards at least MARC, for a field to define its structure... is this correct?
@Kennethhole and @egabancho have been working on this in recent months, so I wonder whether we can co-sprint on this issue to fix it before forthcoming v2.0.0 is out. The motivation is to start with relatively "clean" JSON model, so that we don't force sites to change their data model once this issue will be fully completed.
One of the biggest inconveniencies in my eyes is the behaviour of JSON field return types when no schema is declared. For example, it can return either dictionary or a list of dictionaries, depending on whether there is only one field instance, or more of them. It seems safer to always declare schema
everywhere in order to always return the same type.
Proposal: before v2.0.0 is out, go through the JSON definition file, check used field names and change them to match MARC standard, check declared field repetitiveness in the MARC standard (R, NR) and enforce schema for every field accordingly. Example: for MARC tag 505, the CERN Open Data uses:
formatted_contents_note:
schema:
{'formatted_contents_note': {'type': 'list', 'force': True}}
creator:
marc, "505__", {'miscellaneous': value['g'], 'title': value['t']}
producer:
json_for_marc(), {"505__g": "miscellaneous", "516__t": "title"}
See also https://github.com/cernopendata/opendata.cern.ch/issues/450 and others.
@MSusik (@glouppe) since you already started working on this for representing author records, I think it makes sense for you to also take part to such sprint.
How do you express accept any indicator as valid (#1722)?
For example, how do you express main author, 100 (http://www.loc.gov/marc/bibliographic/bd100.html) or even more, main title, 245 (http://www.loc.gov/marc/bibliographic/bd245.html)?
since you already started working on this for representing author records, I think it makes sense for you to also take part to such sprint.
Fortunately, our record definitions have schema definitions. Moreover, the author record fields are even more strict as I define schema for every subfield. Check https://github.com/inspirehep/inspire-next/pull/170/files for details.
@MSusik, @kaplun, @egabancho, Sorry, but I fail to see how do you express that a 100 field with any first and second indicator is a valid author. May you help me to find it, please?
@egabancho could it be that @fjorba question is answered by simply putting:
creator:
marc, "100%%", {'last': util_split(value['a'], ',', 0),
'first': util_split(value['a'], ',', 1),
'numeration': value['b'],
'title': value['c'],
'birth_year': int_util_split(value['d'], '-', 0),
'death_year': int_util_split(value['d'], '-', 1),
'previous_name': value['i'],
'status': value['g'], 'preferred_name': value['q']}
?
@egabancho, @kaplun , if it is the case, please, please, please, fill all indicators for all fields this way. And only, only, only change the default sctricty for those cases where it is absolutely needed.
Please. Or check with your friendly librarian.
And then you could close also ticket #1722, and be happy.
@kaplun @fjorba the issue with the indicators in JSONALchemy has been solved on pu sometime ago ;-)
You could take a look at 7ce4c93e20009031962161dcea302895efaf1170 and there is an example of it in the tests
@fjorba, as @egabancho mention the functionality is there (i.e. one just need to put "." as a wildcard instead of "_" for the indicators (see the example provided by @egabancho). Can you send in as one of our "friendly librarians" (since at CDS and INSPIRE and Zenodo don't have these use-cases), a patch for the default config where you need to accept "all" indicators? In this way we can integrate once and for all the support for MARC as expected :smile:
@egabancho, @kaplun is right. An absolute Invenio newbie should be able to take standard Marc21 records, like those found at https://archive.org/details/ol_data, upload to her newly installed Invenio and expect them to be indexed and displayed correctly. If not 100% (Marc21 is very large), at least most of it. Yes, without fiddling with esoteric, home grown Invenio specific syntax. Learning those details should be only needed for special, non-standard use cases (maybe like CERN, but unlike the rest of the world).
Sorry, @kaplun, I misunderstood your request. But I'm afraid I don't fully understand the syntax, as it seems to me that there are some stanzas for input and others for output, and moreover I cannot test it. I don't have a dev version for Invenio.
But again, if Invenio states that it complies with Marc21, and you developers know and master this syntax, why don't you fill all indicators for all fields with a wildcard character (whatever it is) so all of them are recognized as valid values?
After doing that, test it with a few thousands real world Marc21 records, like those found at the above archive.org address, to check that it works. This way, prospective new Invenio installations won't be frustated with unmatched claims.
Sorry, @kaplun, I misunderstood your request. But I'm afraid I don't fully understand the syntax, as it seems to me that there are some stanzas for input and others for output, and moreover I cannot test it. I don't have a dev version for Invenio.
Indeed extensive documentation is still in the pipeline...
The issue, as you know, is that in our use case we have always considered MARC as a simple way to encode bibliographic metadata, without trying to respect the whole MARC standard. Therefore without the internal expertise to real polish everything, if we simply add wildcards everywhere, don't we risk to overdo it? Maybe indeed not, and adding wildcards everywhere is a quick-win...
@kaplun, it is much, much, much better to add wildcards for all indicators, that the current situation, where for example no single title is displayed or indexed as title. Right, no single real world Marc21 title (see http://www.loc.gov/marc/bibliographic/bd245.html). It means maybe like going from 10% compliance to 90%.
Do you loose or gain, with that change?
I agree with @fjorba in this discussion. By following the marc21 standard, indicators are used for most of the fields and should therefore work out of the box.
We are currently setting up Invenio master for a client and migrating their bibliographic records from Voyager, so we have plenty of records following the marc21 standards. I did not quite understand your request @kaplun, but we are happy to contribute back what we are doing to allow indicators.
I am proposing a PR (#2651) to try to address the specific issue of #1722 but in naked invenio installation of pu there is basically no MARC defined (beside in the testsuite). I will also send a PR for invenio-demosite with similar proposed changes for Atlantis. Maybe that this solves @fjorba issue?
Thanks, @Kennethhole. @kaplun, I'm a little bit lost with those PR and pu acronyms (I suspect that pu means «proposed updates», but nothing about PR), and a little scared with what I understand, where you propose «no MARC defined». Does it mean that an absolute Invenio newbie will be able to load her standard Marc21 records and a large percentage of their fields will be displayed and indexed out of the box?
For me, as a veteran Invenio user, having learned how to make indicators work the hard way, a better solution would be: please use the fields that you are using in your sub-standard demo records. But accepting any indicator. With this out or the box, we, the community, will happily send you patches to slowly complete as much Marc21 as possible. But we need something that works out of the box, please.
A second step would be: use real Marc21 records. There are millions out there, public domain or CC0, reusable and free to use.
@fjorba PR stands for "pull request" :smiley: it's a GithHub slang. From what I am understanding, Invenio nowadays is being re-packaged into layers (and sometimes separate modules). With the demosite (Atlantis) available separately. If one does not install the demosite package, the base invenio package will contain basically no support for MARC (in terms of pre-existing configuration). So MARC users starting nowadays with Invenio will have to also take the invenio-demosite (i.e. Atlantis) and transform it to their needs.
@egabancho, @tiborsimko, is this correct?
... no support for MARC ...
The plan is to create standard configuration for MARC and friends that will be shipped with invenio
package as base for site customizations. (cc @egabancho)
What @jirikuncar said is correct, we plan to create a contrib
folder inside the record module where we should place any master format that we get from the community, being MARC the first of them.
This way when you install Invenio you can decide which formats you want to enable, and therefore be supported by your installation, plus you will be able to override them and add new ones.
But, as @kaplun pointed out before, for this kind of standard configuration files the help of our "friendly librarian" would be very much appreciated ;-)
@egabancho, I'll be glad to help, as long as I know what I'm doing ;-)
The plan is to create standard configuration for MARC and friends that will be shipped with invenio package as base for site customizations.
... which is exactly the purpose of this very issue! We are cleaning up the default record JSON configuration file so that it would be MARC standard compliant (and MARC parlance compliant). See the clean-up plan outlined in https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-68843843.
Furthermore, once this plan is implemented, Invenio users will be able to do exactly what @fjorba proposes above: take any MARC record, upload it "as is", and expect things to work out of the box, without having to tweak supplied MARC configuration. @kaplun's comment about "no support for MARC" should not be taken literally on the user level; quite the opposite. MARC is the recommended master format to use, see https://github.com/inveniosoftware/invenio/issues/2075#issuecomment-58026572.
@kaplun's comment about "no support for MARC" should not be taken literally on the user level; quite the opposite.
Yes, sorry if I created confusion :innocent: And that's also why I called in you and @egabancho to correct me. My assumption was based on the current pu source code only.
... which is exactly the purpose of this very issue!
Well not really, as this is issue originally talks about atlantis.cfg
and that's where my very personal confusion stems from.
Great @tiborsimko and @kaplun, thanks a lot!
Of course, I don't know anything about the Invenio internal uses of those configuration files, so may I ask: are they used also for those different record syntax exporting? And for OAI harvesting and publishing?
this is issue originally talks about atlantis.cfg
Yeah, because if at all possible, we'd like to clean record JSON configuration both in master and next/pu branches, so that the stored JSON would be "almost" the same, if not fully the same, in Invenio 1 and Invenio 2 sites. In this way it will be easier for sites to upgrade, or try to run LABS site ot out of the same database. (Although not many modules use record JSON in Invenio 1, this is not that critical for master based sites; but it would be important for future 2.0 to 2.1 updates as well.)
I don't know anything about the Invenio internal uses of those configuration files, so may I ask: are they used also for those different record syntax exporting? And for OAI harvesting and publishing?
In principle, one can produce various needed output formats out of JSON. Also, one can do it from the master format for each concrete record. The two approaches may be illustrated as follows:
record 123 (=master format MARC)
\
\
------> record JSON ---> DC
/
record 124 (=master format EAD) /
versus:
record 123 (=master format MARC) -------------------------> DC
\
\
------> record JSON
/
record 124 (=master format EAD) / ------------------------> DC
In the first case, one standardises every possible master format on the input site to the same JSON, that all the various ouput channels (such as production of Dublin Core output format) use directly.
In the second case, even though one also creates record JSON and uses it for various core indexing and HTML displaying needs, the production of the Dublin Core output format relies on converting master formats, e.g. because one has MARC2DC style steet ready, or because it is otherwise more pratical to work with the record master format rather then standardised JSON.
The first option is still kind of "remoter" possibility, because many Invenio modules still assume that their master format is MARC.
BTW you can read some documentation about record JSON configuration format and possibilities at http://invenio.readthedocs.org/en/latest/modules/jsonalchemy.html
Thank you, @tiborsimko, it looks promising!
Should the correct milestone be 1.2.x
or 1.x
? In Invenio 2 we should wait to new pythonic JSONAlchemy coming in ~2.2
.
Superseded by Invenio 3. General MARC clean-up happened in https://github.com/inveniosoftware/dojson and https://github.com/inveniosoftware/invenio-marc21
Similarly to #1557, there are more clean-ups to be done in
atlantis.cfg
. For example, ininvenio.conf
we have:while in
atlantis.cfg
we have:and:
Notice the naming differences: e.g. "oai.indicator" should rather be called "oai.set"; "oai.value" should rather be called "oai.id".
Historically,
909CO
MARC tag in Invenio Atlantis defaults was used to store OAI information. Later,0248
was chosen on CDS and elswhere as a better MARC tag location. Invenio Atlantis defaults were not changed though, which permitted e.g. to test easily how Invenio is configurable for various site conditions.However, as it is agreed that
0248
is better OAI location, we should probably move defaults there, so that new Invenio installations are using by default "good" MARC tags already.(This is similar with 773 vs 909C4, etc, see #1557.)
P.S. See also other FIXMEs in
atlantis.cfg
. It would be useful to go through the file, fix all FIXMEs, and fixdemobibdata.xml
anddemocfgdata.sql
andinvenio.conf
accordingly.