Open jmartinm opened 7 years ago
@inveniosoftware-contrib/inspire-content we are ready to implement this. Can we complete a list of errors and warnings that should be triggered in Literature records?
Warning:
Error:
Many fields in the schema are conditional on the document_type, so an error should be raised if the document_type
is not present but the other fields are. Maybe it makes more sense to have the document_type
as the first entry of the record editor, and have default fields that depend on this choice, or group fields by document_type
and have a way to hide them (like on the submission form).
book_series
requires document_type
in {book, proceedings, thesis}
isbns
requires document_type
in {book, proceedings, thesis}
cnum
requires document_type
in {proceedings, conference paper}
thesis_info
requires document_type
= thesis
collaborations
requires accelerator_experiments
document_type: proceedings
requires cnum
document_type: conference paper
requires cnum
document_type: thesis
requires thesis_info
Note that there are a bunch of formally invalid ISBNs around. Most likely if it's not 10 or 13 digits and invalid it should be a warning only. (BTW: do you set ISBNs for chapters in books, if any?)
One might (should?) be tempted to hook up ISBN entry with a call against a catalogue and try to import data if not there already. (@jmartinm should have a piece of code for GVK import.)
ISSN would be a valuable field that may be present for articles and book series.
Note that there are a bunch of formally invalid ISBNs around. Most likely if it's not 10 or 13 digits and invalid it should be a warning only.
What are invalid ISBNs useful for? if they are invalid, we know that no actual book can possibly correspond to it.
(BTW: do you set ISBNs for chapters in books, if any?)
For chapters in books, we have parent_isbn
, which is a different field.
What are invalid ISBNs useful for?
I'd like to refer this question to the publishing industry...
if they are invalid, we know that no actual book can possibly correspond to it.
I fear this assumption does not hold. There exist formally wrong ISBNs (eg. wrong checksum) that are used and generated by publishers just like valid ISBNs. Sometimes they are corrected sometimes not. Nevertheless, usually they are printed into the book and thus exist in hardware. They are common enough that they made it into the standard. Marc 020 subfield $z
.
Side note: quite some multi volume books share the same ISBN, ie. you can not assume that the ISBN has to be unique either, even if you merge subsequent editions of the same book to one record (effectivley getting rid of a number of these issues) this may be relevant if you have the individual entries for the volumes. Additionally, if you strip the -
you may have two "identical" numbers that are not the same. (Chances are small for INSPIRE, however. Usually those dupes result form very small publishers. It's a quite common problem for poetry...)
Yeah INSPIRE has very little book, and we can give a run to ensure that the current 2K isbns we know are valid. Then the chances of having an important HEP book with a broken ISBN becomes really small.
Turns out we have 36 records with invalid ISBNs on INSPIRE (when stripping some extra crap in the $$a field that does not belong there). I created an Asana task for this.
Actually I think, going field by field we can come up with tons of warnings. E.g.:
@kaplun if you have a working or even good checker for "valid latex" could you drop me a note by pm? TIA.
List updated
book_series
title
matches title from journal collection (in that case, publication_info
should be used)value
should be unique in array, so same value
but different sources
should be an error
Here we can keep track of all the complex validators (dependent on INSPIRE data model) that can be implemented and that should issue either warnings or errors.
The validation might be implemented on the Python side and the editor will just call the URL to get the validation warnings.
Errors
book_series
requiresdocument_type
in{book, proceedings, thesis}
isbns
requiresdocument_type
in{book, proceedings, thesis}
cnum
requiresdocument_type
in{proceedings, conference paper}
thesis_info
requiresdocument_type
in{thesis}
Warnings
publication_information
field, 2 cnums are possible in rare situations. Having 2 cnums should trigger a warning.collaborations
requiresaccelerator_experiments
document_type: proceedings
requirescnum
document_type: conference paper
requirescnum
cnum
present anddocument_type
not in {proceedings, conference_paper}document_type: thesis
requiresthesis_info