roadmap: preparing data model

kaplun commented 9 years ago

Dear @fschwenn, @aw-bib, @ksachs, @tsgit (and whoever wants to take part), this ticket wants to aggregate in one place all the steps needed to complete the new INSPIRE data model.

[x] Preparing HEP schema: https://github.com/inspirehep/inspire-next/blob/master-elasticsearch/inspire/dojson/hep/schemas/hep-0.0.1.json (see against https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkup)
[x] Preparing HepNames schema: https://github.com/inspirehep/inspire-next/blob/master-elasticsearch/inspire/dojson/hepnames/schemas/hepnames-0.0.1.json(see against https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupHepnames)
[x] Preparing missing schemas:
- [x] Institutions (see https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupInstitutions)
- [x] Conferences (see https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupConferences)
- [x] Jobs (see https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupJobs)
- [x] Experiments (see https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupExperiments)
- [x] Journals (see https://twiki.cern.ch/twiki/bin/view/Inspire/DevelopmentRecordMarkupJournals)
[ ] Addressing all FIXME in hep-0.0.1.json (@fschwenn, @aw-bib, @ksachs, @tsgit we'll need your help here):
- [ ] Deciding what should be an enum, and what value should it have
- [ ] Deciding on abandoning outdated fields
- [ ] Finalizing way of relating to other record types
[ ] Cleaning up on production outlying values. Again this might require a distributed effort across curators. I have prepared a set of statistic data (see: https://github.com/inspirehep/inspire-next/tree/master-elasticsearch/inspire/dojson/current_marcxml_usage)

cleggm1 commented 9 years ago

Experiments use case - displaying current and previous spokespeople:

A user supplies information for us to update the current spokesperson for an experiment. The current display shows all spokespeople with no differentiation. The user comments that one of the spokespeople is no longer a spokesperson. While we store the information on who is the current spokesperson and the dates of this status, the display is ambiguous.

Current display: Spokesperson: Shutt, Thomas Alan; Nelson, Harry N.

Current MARC: 001262631 702 $$aShutt, Thomas Alan$$e2014$$iINSPIRE-00261399 001262631 702 $$aNelson, Harry N.$$d2014$$iINSPIRE-00110832$$zCurrent

kaplun commented 9 years ago

Hi Melissa, this issue is specificaly to model what we store (and hence curate), not how we display. Could you add it as a new issue?

kaplun commented 9 years ago

For those wishing to help me in the data model, please have a look at:

To see statistical information about current MARC usage for each record type see the corresponding files in https://github.com/inspirehep/inspire-next/tree/master-elasticsearch/inspire/dojson/current_marcxml_usage

In order to contribute you can create dedicate github issue and refer to this very same issue #265.

fschwenn commented 9 years ago

I have a few (first) comments on the HEP model:

Should the enums really be inside the scheme? For ISBNs it might be ok for the next years but for publishers It's rather some dynamic knowledgebase. Many of the enums might look as if they are sufficient but panta rhei.

It's a bit dangerous to start from the existing data model. We might end up with the same problems we have now. We definetly should have a look at all the cases which drove us to despair within the existing data model ;-)

"funding_info" FIXME: Do we care about this? So far only 349 records were tagged and all for a single EU project.

This is a political decision for DIR. From the curation point of view, funds are a nightmare because in most the cases they are just somewhere in the fulltext but not in the "usual metadata".

"isbn" FIXME: this really need to be an enum and cleaned up. What is Print?!

Print is the generic term for hardcover and paperpack

"abstract" FIXME: is there an enumerable list of sources?

No. New sources can pop up at any time.

"abstract"

Do we want to have all the abstracts of different arXiv-versions? If yes, we have to know which is the most recent.

"imprint" FIXME: an enum?

No. See above.

"titles"

May be we also should have "language" there.

"thesis_supervisor",

I think it should use the same object as "authors".

"thesis" FIXME: shall we match these with the insitution database? I guess so.

Yes! "university" should be the same object as "affiliation". There will also be special cases where degrees are not bestowed by an university.

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

"publication_info", "conference_paper_info" FIXME: This is currently the CNUM

I guess we should keep two: the cnum and a free text field.

"publication_info", "page_range" FIXME: for ejournals this could be the page index, but there is no realiable way to know whether something is a page index or a first page, does it?

We need to different entries: "page_range" AND "article ID". There are cases where both exist at the same time! Of course it is not possible to make this distinction backwards for all records. But with little effort one could do it for a very large fraction.

"publication_info" FIXME: Shall we split conference information away?

No. I would very much like to have all publication infos together.

"publication_info" FIXME: shall we move the DOI and ISBN next to where it belongs? So that we can also align erratum and friends?"

+1. I would very much like to have all publication infos together.

"publication_info", "year"

Can be in fact more then 1 year: http://cis01.central.ucv.ro/pauc/vol/1994_1995_4_5/1994-1995_92-99.pdf

"reference"

I would like to keep several list of references - typically arXiv and pubulisher as for the abstract.

"reference", "report_number"

should we make an extra entry for arXiv?

"copyright" FIXME: should we restrict this to an enum, or not?

Again. enum in principle is good to have a unique way of writinga publisher, but the list should be easily extentable.

"thesaurus_terms" FIXME: What... is... that!?

"energy_range": { "maximum": 8, "description" : "It encodes the energy of the experiment or raction; the energy is below "10**(energy_range / 2) GeV" for energe_range < 7, below 10 TeV for energy range 7, or above 10 TeV for 8" }

"thesaurus_terms"

It might be useful to distinguish INSPIRE keywords from INSPIRE reactions Author's or publisher's keywords should also be stored here? I could not find a replacement for 6531.

"experiment" Was the experiment actually proofchecked by a cataloguer?

Yes.

"arxiv_eprints"

"pattern": "\d{4}.\d{4}{5}|\w+-+/\d+" could be even "pattern": "\d{4}.\d{4}{5}|\w+-+/\d{7}", right?

"authors", "name"

"format": ".+, .+" is too restrictive: http://www.ihep.ac.cn/english/conference/icrc2011/paper/proc/v9/v9_1348.pdf

For "email" the format could be a bit more restrictive, something like '.@..[a-z]+'

"citeable" FIXME: can this be derived from other properties?

+1

"url"

"size" in which units?

aw-bib commented 9 years ago

Hi!

Should the enums really be inside the scheme?

I came to this question as well. Usually, I've the gut feeling that an authority link scales better and is easier to maintain. I also understood that changes in the schema are technically a database conversion. (In the discussion of https://github.com/inveniosoftware/dojson/issues/23). However, I understood that dojson is working that way. I'm not sure that I like this part yet.

For ISBNs it might be ok for the next years but for publishers It's rather some dynamic knowledgebase. Many of the enums might look as if they are sufficient but panta rhei.

Publishers will be difficult, indeed, I think.

It's a bit dangerous to start from the existing data model. We might end up with the same problems we have now. We definetly should have a look at all the cases which drove us to despair within the existing data model ;-)

+1

"funding_info" FIXME: Do we care about this? So far only 349 records were tagged and all for a single EU project.

This is a political decision for DIR. From the curation point of view, funds are a nightmare because in most the cases they are just somewhere in the fulltext but not in the "usual metadata".

+1 for the curators point of view. However, as the EU is mentioned, given the OpenAIRE context etc...

"isbn" FIXME: this really need to be an enum and cleaned up. What is Print?!

Print is the generic term for hardcover and paperpack

For ISBN there should probably be something like "formally known to be wrong". There're quite a few ISBNs with broken checksums out there. Ok, this makes the checksum senseless, but it might be a good idea to check ISBNs based on the checksum unless one explicitly knows that it is wrong. Could also help cataloguers, if it's checked upon input. For the checksum https://en.wikipedia.org/wiki/International_Standard_Book_Number#ISBN-10_check_digits

"abstract"

Do we want to have all the abstracts of different arXiv-versions? If yes, we have to know which is the most recent.

Sounds like a version field.

"imprint" FIXME: an enum?

No. See above.

+1. Enum would not work IRL

"thesis_supervisor",

I think it should use the same object as "authors".

+1

Probably it is "persoal name" + a role field (1001_ $a + $e). Probably $e as enum. (Though we also tend to move from enum to authorities @join2.)

"thesis" FIXME: shall we match these with the insitution database? I guess so.

Yes! "university" should be the same object as "affiliation". There will also be special cases where degrees are not bestowed by an university.

In authors the description for affiliation reads "as it appears on the paper". This would be a non-normalized string. Probably something to rethink.

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

Depends on whether you treat this as part of the title. Ie. is "Physical Review / D" the title or "Physical Review" series: "D". It's a decision.

"publication_info", "page_range" FIXME: for ejournals this could be the page index, but there is no realiable way to know whether something is a page index or a first page, does it?

We need to different entries: "page_range" AND "article ID". There are cases where both exist at the same time! Of course it is not possible to make this distinction backwards for all records. But with little effort one could do it for a very large fraction.

Is it worthwhile to consider "start page" / "end page"? I admit that I tend to treat article numbers as "start page" for practical purposes.

"publication_info" FIXME: shall we move the DOI and ISBN next to where it belongs? So that we can also align erratum and friends?"

+1. I would very much like to have all publication infos together.

This could get complex. Book series, journals, conferences, multivolumes, publishers and places... Multivolume books in a book series being the special issue of a journal. My gut feeling is to split it into logical chunks.

"publication_info", "year"

Can be in fact more then 1 year: http://cis01.central.ucv.ro/pauc/vol/1994_1995_4_5/1994-1995_92-99.pdf

Also quite common for theses published as books later on (if those records are merged on inspire, not sure).

"reference"

I would like to keep several list of references - typically arXiv and pubulisher as for the abstract.

Sounds like source subfield.

For licence one could consider to have some common ones as suggestions. (CC-licences come to mind.) Is there something like enum with a free form value possible?

Kind regards,

Alexander Wagner

Deutsches Elektronen-Synchrotron DESY Library and Documentation

Building 01d Room OG1.444 Notkestr. 85 22607 Hamburg

phone: +49-40-8998-1758 fax: +49-40-8994-1758 e-mail: alexander.wagner@desy.de

fschwenn commented 9 years ago

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

Depends on whether you treat this as part of the title. Ie. is "Physical Review / D" the title or "Physical Review" series: "D". It's a decision. Phys.Rev.D for me is the journal, but for Nuovo Cimento A you have something like "Series 10" and "Series 11" with the same volume numbers within the series.

kaplun commented 8 years ago

Removing milestone since this is no longer a blocker for Enabling search. It needs just to be polished little by little.

jalavik commented 8 years ago

@annetteholtkamp mentioned to me that it could be a good idea to have a "raw" affiliations field in the data model and use value as the transformed value. We seem to have both raw and treated affiliations in the same field now.

I cannot see any "raw" field in the author either, but there is raw_reference in references. Shall we decide a general direction for this. E.g. shall we add raw fields like this?

"affiliations": {
    "uniqueItems": true,
    "items": {
        "type": "object",
        "properties": {
            "curated_relation": {
                "type": "boolean",
                "description": "Did a cataloguer proof-checked the recid?",
                "title": "The affiliation is curated?"
            },
            "recid": {
                "type": "integer",
                "description": "Record ID in the Institution collection",
                "title": "Record ID of institution"
            },
            "value": {
                "type": "string",
                "description": "The transformed affiliation",
                "title": "Name of institution"
            },
            "raw": {
                "type": "string",
                "description": "The affiliation as it appears on the paper or original import",
                "title": "Name of institution"
            }
        },
        "title": "Affiliation"
    },
    "type": "array",
    "title": "Affiliations"
}

aw-bib commented 8 years ago

@jalavik there was some discussion under sams preliminary name of gigantic workflow I think the decision on the point of @annetteholtkamp depends on the decision for this workflow.

bing13 commented 8 years ago

retaining the original strings in an easily accessible form is a good safeguard against unforeseen future needs. Storage is cheap, labor is scarce.

kaplun commented 8 years ago

:+1: (of course case by case). I like the idea of standardizing of having a value which is supposedly normalized against an external reference (e.g. affiliation against institution DB, conference ID against conf DB), Vs. raw. In this way the model can be predictable:

raw: raw original string
value: normalized value against reference DB
recid: recid of the corresponding DB (1 to 1 with value)
corresponding linked record.

"record": {
    "$ref": "http://inspirehep.net/foo/123"
}

salmele commented 8 years ago

An additional way to look at this would be to allow the normalization against more than one source, retaining a pointing to that. An example in mind would be for instance normalizing an institution against THREE sources: ISNI, and record that external ID, GRID.ac, and record that external ID, and whatever in that moment in time we'd have as INSPIRE institution DB, and retain the recid.

In addition, for this particular example, we'd keep the raw for trying at a later stage, programmatically, to normalize those which we failed to get right upon some ingestio against some of those external sources as they increasingly add more institutes.

annetteholtkamp commented 8 years ago

Is it worthwhile to multiply the id’s in the HEP records? I’d think one would be sufficient, the others you may get via lookup in the inst collection. Or are you thinking of those cases where mapping to different standards may return different results?

Annette

On 04 Mar 2016, at 08:59, Salvatore Mele notifications@github.com wrote:

An additional way to look at this would be to allow the normalization against more than one source, retaining a pointing to that. An example in mind would be for instance normalizing an institution against THREE sources: ISNI, and record that external ID, GRID.ac, and record that external ID, and whatever in that moment in time we'd have as INSPIRE institution DB, and retain the recid.

In addition, for this particular example, we'd keep the raw for trying at a later stage, programmatically, to normalize those which we failed to get right upon some ingestio against some of those external sources as they increasingly add more institutes.

— Reply to this email directly or view it on GitHub.

salmele commented 8 years ago

The latter.

It is a bit like today for an author we'd store INSPIRE ID, BAI, ORCID, GoogleScholar and whatnot. It might be that for an affiliation in a paper we'd have a hit which resolves e.g. in a service (ISNI) but not in another (GRID.ac) and we ourselves would have even a different way to say things in our own DB.

Mind that I'm not advocating we'd do it this way, but I'm advocating that it might be appropriate to have this at the individual record level, as look-ups might fail.

aw-bib commented 8 years ago

Mind that I'm not advocating we'd do it this way, but I'm advocating that it might be appropriate to have this at the individual record level, as look-ups might fail.

Would not kind of an authority record that lives locally and serves for these lookups be better than storing n+1 ids on the bibliographic level? For search one should be able to expand ids from this auth rec. Especially considering that a new id might come along as time passes one would just need to update one record and not all bibliographic ones.

kaplun commented 8 years ago

:+1:

If look ups are done using the raw string, then we should simply work on improving our tools that perform the automatic matching against the authority records (e.g. for journal record we store all the name variations, so if a HEP record is published in a journal that we can't match this should raise a flag to a cataloguer for inspecting the issue).

If we perform lookup via IDs (because the publisher has provided them), then this should work because we should maintain our authority records aligned with external DBs such as GRID.ac etc.

kaplun commented 8 years ago

Closing this as it is nowadays superseded by several dedicated issues.

inspirehep / inspire-next

roadmap: preparing data model #265