relaton / relaton-doi

Relaton-DOI: retrieve bibliographic items using DOI
MIT License
0 stars 0 forks source link

Find example of each bibtype on Crossref, for spec testing of gem #10

Closed opoudjis closed 1 year ago

opoudjis commented 1 year ago

Also confirm whether Crossref disambiguates journals with identical titles in entries for articles published in them.

Need maximum number of fields to test. In particular:

opoudjis commented 1 year ago

I will need to keep this issue assigned to myself, to review the generated XML.

I'm giving all Crossref types, an exemplar, and (where available) how that citation is rendered by the Mouton stylesheet I use following the Unified style sheet for linguistics (USAL). Absent such a citation, I give Crossref's own BIBTEX citation. (Their human-readable citations are all missing places of publication and publisher, so they are unusable.)

I have also found that Crossref systematically fail to apply the fine differences between types that they claim: edited books end up as monographs, peer reviews as journal articles, journal issues as journal articles.

I am looking up all the more obscure book types via https://api.crossref.org/types/{identifier}/works

Arguably a lot of these are edge cases, but having seen how uneven the quality of data in Crossref is, I really don't want to take any chances.

Heller, Monica. 2001. Gender and public space in a bilingual school. In Aneta Pavlenko, Adrian Blackledge, Ingrid Piller & Marya Teutsch-Dwyer (eds.), Multilingualism, second language learning, and gender (Language, Power and Social Process 6), 257–282. Berlin & New York: Mouton de Gruyter.

@incollection{2020,
    doi = {10.1215/9781478007609-047},
    url = {https://doi.org/10.1215%2F9781478007609-047},
    year = 2020,
    publisher = {Duke University Press},
    pages = {177--179},
    title = {Occupied Haiti (1915{\textendash}1934)},
    booktitle = {The Haiti Reader}
}
@incollection{Hubbard,
    doi = {10.14509/23007},
    url = {https://doi.org/10.14509%2F23007},
    publisher = {Alaska Division of Geological {\&} Geophysical Surveys},
    author = {T. D. Hubbard and M. L. Braun and R. E. Westbrook and P. E. Gallagher},
    title = {High-resolution lidar data for infrastructure corridors, Wiseman Quadrangle, Alaska},
    booktitle = {High-resolution lidar data for Alaska infrastructure corridors}
}
@misc{1,
    doi = {10.1787/20743300},
    url = {https://doi.org/10.1787%2F20743300},
    publisher = {{OECD}},
    title = {Chemical Thermodynamics}
}
@misc{1,
    doi = {10.7139/2017.978-1-56900-592-7},
    url = {https://doi.org/10.7139%2F2017.978-1-56900-592-7},
    publisher = {{AOTA} Press},
    editor = {Karen Jacobs and Judith Parker Kent and Albert Copolillo and Roger Ideishi and Shawn Phipps and Sarah McKinnon and Donna Costa and Nathan Herz and Guy McCormack and Lee Brandt and Karen Duddy},
    title = {Occupational Therapy Manager, 6th Ed}
}
@incollection{1,
    doi = {10.1017/isbn-9780511132971.eh132-135},
    url = {https://doi.org/10.1017%2Fisbn-9780511132971.eh132-135},
    publisher = {Cambridge University Press},
    pages = {5--795--5--795},
    editor = {Roger L. Ransom},
    title = {Wholesale commodity price indexes in Richmond, the eastern Confederacy, New York, and San Francisco: 1861-1865},
    booktitle = {Historical Statistics of the United States}
}
@book{Bouchard_2013,
    doi = {10.1093/acprof:oso/9780199681624.001.0001},
    url = {https://doi.org/10.1093%2Facprof%3Aoso%2F9780199681624.001.0001},
    year = 2013,
    month = {sep},
    publisher = {Oxford University Press},
    author = {Denis Bouchard},
    title = {The Nature and Origin of Language}
}
@misc{1,
    doi = {10.1371/journal.pone.0020476.s005},
    url = {https://doi.org/10.1371%2Fjournal.pone.0020476.s005},
    publisher = {Public Library of Science ({PLoS})}
}

(This suggests bad encoding which I don't think we can do much: this is a dataset, but with next to no information recoverable, including the researchers or the date or the associated journal article.)

BIBTEX has frozen: RIS is:

TY  - GENERIC
DO  - 10.6019/pxd038478
UR  - http://dx.doi.org/10.6019/pxd038478
TI  - ProteomeXchange dataset
PB  - EMBL-EBI
ER  -
@misc{1,
    doi = {10.1163/2214-871x_ei1_sim_5628},
    url = {https://doi.org/10.1163%2F2214-871x_ei1_sim_5628},
    publisher = {Brill},
    title = {{TAʿIZZ}}
}
@phdthesis{Martins,
    doi = {10.11606/t.8.2017.tde-08052017-100442},
    url = {https://doi.org/10.11606%2Ft.8.2017.tde-08052017-100442},
    publisher = {Universidade de Sao Paulo, Agencia {USP} de Gestao da Informacao Academica ({AGUIA})},
    author = {Homero Moro Martins},
    title = {N{\'{o}}s temos nosso direito que {\'{e}} o certo: significados das lutas por reconhecimento entre comunidades do Vale do Ribeira, S{\~{a}}o Paulo}
}
BIBTEX rendering freezes

We have not modelled grants nor patents to date in relaton. We may need to yet; for now, just treat it as misc.

Neuman, Yair, Yotam Lurie & Michele Rosenthal. 2001. A watermelon without seeds: A case study in rhetorical rationality. Text 21(4). 543–565.

Majid, Asifa & Melissa Bowerman (eds.). 2007. Cutting and breaking events: A crosslinguistic perspective. [Special issue]. Cognitive Linguistics 18(2).

@misc{1991,
    doi = {10.1111/read.1991.25.issue-1},
    url = {https://doi.org/10.1111%2Fread.1991.25.issue-1},
    year = 1991,
    month = {apr},
    publisher = {Wiley},
    volume = {25},
    number = {1}
}

Change the Relaton mapping of journal-issue and journal-volume from journal to article

It turns out that https://api.crossref.org/v1/works/10.1515/cog.2007.005 is journal-issue, but is marked up as journal-article

@misc{1,
    doi = {10.46409/001.rlpt5688},
    url = {https://doi.org/10.46409%2F001.rlpt5688},
    publisher = {University of St. Augustine for Health Sciences Library},
    volume = {1}
}
@misc{1,
    doi = {10.46528/jk},
    url = {https://doi.org/10.46528%2Fjk},
    publisher = {{JUNI} {KHYAT}},
    title = {Juni Khyat Journal}
}

The first example is incorrectly marked up, it is in fact an edited book:

Aneta Pavlenko, Adrian Blackledge, Ingrid Piller & Marya Teutsch-Dwyer (eds.) 2001. Multilingualism, second language learning, and gender (Language, Power and Social Process 6). Berlin & New York: Mouton de Gruyter.

@book{Kuster_1852,
    doi = {10.5962/bhl.title.124254},
    url = {https://doi.org/10.5962%2Fbhl.title.124254},
    year = 1852,
    publisher = {Verlag von Bauer und Raspe (Julius Merz),},
    author = {H. C. Kuster and Johann Hieronymus Chemnitz and Friedrich Heinrich Wilhelm Martini},
    title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen /}
}
@misc{2022,
    doi = {10.1108/oxan-es268033},
    url = {https://doi.org/10.1108%2Foxan-es268033},
    year = 2022,
    month = {mar},
    publisher = {Emerald},
    title = {Extradition would boost image of Honduras's Castro}
}

It's "misc", so not much we can do about it

@misc{2021,
    doi = {10.1111/jan.15115/v3/decision1},
    url = {https://doi.org/10.1111%2Fjan.15115%2Fv3%2Fdecision1},
    year = 2021,
    month = {oct},
    publisher = {Wiley},
    editor = {Debra Jackson},
    title = {Decision letter for "Self-care of patients with multiple chronic conditions and their caregivers during the {COVID}-19 pandemic: A qualitative descriptive study"}
}
@article{Wiechert_2019,
    doi = {10.1101/751156},
    url = {https://doi.org/10.1101%2F751156},
    year = 2019,
    month = {aug},
    publisher = {Cold Spring Harbor Laboratory},
    author = {Johanna Wiechert and Andrei Filipchyk and Max Hünnefeld and Cornelia Gätgens and Ralf Heermann and Julia Frunzke},
    title = {Deciphering the rules underlying xenogeneic silencing and counter-silencing of Lsr2-like proteins}
}

Please change from erroneous mapping of "posted-content" to "social_media": that is not how this is being used by Crossref.

@inproceedings{Hikita,
    doi = {10.1109/icpadm.1994.414074},
    url = {https://doi.org/10.1109%2Ficpadm.1994.414074},
    publisher = {{IEEE}},
    author = {M. Hikita and H. Yamashita and T. Kato and N. Hayakawa and T. Ueda and H. Okubo},
    title = {Electromagnetic spectrum caused by partial discharge in air under {AC} and {DC} voltage application},
    booktitle = {Proceedings of 1994 4th International Conference on Properties and Applications of Dielectric Materials ({ICPADM})}
}
@misc{1,
    doi = {10.15405/epsbs(2357-1330).2021.6.1},
    url = {https://doi.org/10.15405%2Fepsbs%282357-1330%29.2021.6.1},
    publisher = {European Publisher},
    title = {European Proceedings of Social and Behavioural Sciences}
}
CITE NOT WORKING
@book{2004,
    doi = {10.1201/9781439864852},
    url = {https://doi.org/10.1201%2F9781439864852},
    year = 2004,
    month = {oct},
    publisher = {A K Peters/{CRC} Press},
    editor = {Ken Greenebaum and Ronen Barzel},
    title = {Audio Anecdotes {II}}
}
@misc{2007,
    doi = {10.1093/ww/9780199540884.013.u52741},
    url = {https://doi.org/10.1093%2Fww%2F9780199540884.013.u52741},
    year = 2007,
    month = {dec},
    publisher = {Oxford University Press},
    title = {Boxer, Rear-Adm. Henry Percy, (14 Oct. 1885{\textendash}30 June 1961)}
}

Note that this is to be treated just like book-chapter, and you will need to look up the details of the containing title item.

TY  - GENERIC
DO  - 10.3133/ofr72419
UR  - http://dx.doi.org/10.3133/ofr72419
TI  - Seismicity map of greater San Francisco Bay area, California (1969-1971)
T2  - Open-File Report
AU  - 
PY  - 1972
PB  - US Geological Survey
SN  - 2331-1258
ER  - 

Please change mapping report-component to dataset instead of techreport

@misc{2014,
    doi = {10.1787/5jxvk6shpvs4-en},
    url = {https://doi.org/10.1787%2F5jxvk6shpvs4-en},
    year = 2014,
    month = {oct},
    publisher = {Organisation for Economic Co-Operation and Development  ({OECD})},
    title = {Investment Treaties and Shareholder Claims: Analysis of Treaty Practice}
}
@techreport{1973,
    doi = {10.3133/i747},
    url = {https://doi.org/10.3133%2Fi747},
    year = 1973,
    publisher = {{US} Geological Survey},
    title = {Map showing areas of estimated relative amounts of landslides in California}
}
@misc{1,
    doi = {10.31030/2640440},
    url = {https://doi.org/10.31030%2F2640440},
    publisher = {Beuth Verlag {GmbH}},
    title = {{DIN} {ETS} 300218:1994-01, {ISDN}$\mathsemicolon$ Protokolle der unteren Schichten für Videotex auf Syntaxbasis für {ISDN}-Paketmodus ({CCITT}-Empfehlung{\_}X.31 Fall{\_}A und Fall{\_}B)$\mathsemicolon$ Englische Fassung {ETS}{\_}300218:1993}
}

DIN is the German national standards body. Note that ISO standards in Crossref are treated as books.

opoudjis commented 1 year ago

Note:

Crossref does not include monograph series; cf.

https://api.crossref.org/v1/works/10.1515/9783110889406

{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,10,29]],"date-time":"2022-10-29T02:25:37Z","timestamp":1667010337277},"reference-count":0,"publisher":"DE GRUYTER MOUTON","isbn-type":[{"value":"9783110170269","type":"print"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2001,12,31]]},"DOI":"10.1515\/9783110889406","type":"book","created":{"date-parts":[[2011,3,18]],"date-time":"2011-03-18T18:18:11Z","timestamp":1300472291000},"source":"Crossref","is-referenced-by-count":75,"title":["Multilingualism, Second Language Learning, and Gender"],"prefix":"10.1515","member":"374","published-online":{"date-parts":[[2001,12,31]]},"container-title":[],"original-title":[],"link":[{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9783110889406\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,4,21]],"date-time":"2021-04-21T06:55:06Z","timestamp":1618988106000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9783110889406\/html"}},"subtitle":[""],"editor":[{"given":"Aneta","family":"Pavlenko","sequence":"first","affiliation":[]},{"given":"Adrian","family":"Blackledge","sequence":"additional","affiliation":[]},{"given":"Ingrid","family":"Piller","sequence":"additional","affiliation":[]},{"given":"Marya","family":"Teutsch-Dwyer","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2001,12,31]]},"ISBN":["9783110170269"],"references-count":0,"alternative-id":["10.1515\/9783110889406"],"URL":"http:\/\/dx.doi.org\/10.1515\/9783110889406","relation":{},"published":{"date-parts":[[2001,12,31]]}}}

with

Heller, Monica. 2001. Gender and public space in a bilingual school. In Aneta Pavlenko, Adrian Blackledge, Ingrid Piller & Marya Teutsch-Dwyer (eds.), Multilingualism, second language learning, and gender (Language, Power and Social Process 6), 257–282. Berlin & New York: Mouton de Gruyter.

The monograph series and number, "Language, Power and Social Process 6", is not given in the source metadata.

opoudjis commented 1 year ago

We have a problem with the example book-part https://api.crossref.org/v1/works/10.1215/9781478007609-047

We need to look up the author of the containing work. If the contained work has no author or editor contributors, that information MUST be copied from the containing work.

The search for the parent book in this instance has been really difficult, as https://api.crossref.org/works?query=The+Haiti+Reader+2020+Duke+University+Pres shows: the parent book is not in the top 20 results!!! For that reason, we may have to ignore their advice: if we don't get a "book" type result in the first few hits, we should do filtering by type:

https://api.crossref.org/works?query=The+Haiti+Reader+2020&filter=type:book

That does return the sought after item immediately.

Don't use just filter=type:book, because of the multiple variants they have encoded; use:

filter=type:book&filter=type:book-set&filter=type:edited-book&filter=type:monograph&filter=type:reference-book

The data is messy in Crossref, and we have to cope with it.

opoudjis commented 1 year ago

Note that if ANY bibliographic title (e.g. https://api.crossref.org/v1/works/10.1163/2214-871x_ei1_sim_5628 , a dataset) has a container_title, you have to look up the editors or authors of that item; that doesn't only apply to the types I've mapped to inbook, incollection etc, but also to dataset, which (as in this instance) can be published within a book, and to report-component, which is part of a report (so &filter=type:report).

opoudjis commented 1 year ago

The Unicode in DOI JSON is encoded with \u strings, e.g. in https://api.crossref.org/v1/works/10.11606/t.8.2017.tde-08052017-100442 , São Paulo is encoded as "S\u00e3o Paulo". Please convert all such Unicode to UTF-8.

opoudjis commented 1 year ago

We have a problem in that the place of publication is given in the chapters, e.g. https://api.crossref.org/v1/works/10.1515/9783110889406.257, but not the edited volume! The same occurs with https://api.crossref.org/v1/works/10.1515/9780691229409 .

I'm reluctant to fix this right away, but this is going to result in us incorrectly not knowing where such books were published.

To resolve this, we just search on the title, but with filter=type:book-chapter.

(Sadly again, because of the multiple types, it is in fact

filter=type:book-chapter&filter=type:book-part&filter=type:book-section&filter=type:book-track

and if this is a reference-book, we search also on reference-entry.)

So:

https://api.crossref.org/works?query=The+New+Natural+History+of+Madagascar+2020+Princeton+University+Press&filter=type:book-chapter

That gives us a reference with publication-location, a field missing from the book item.

opoudjis commented 1 year ago

The journal volume https://api.crossref.org/v1/works/10.46409/001.rlpt5688 is marked up strangely: no title, but a container-title, which also includes item information

"container-title":["Fall 2020, Innaugural Issue","Student Journal of Occupational Therapy"]

I'm coming up with the following business rule for journal-issue and journal-volume:

The journal is "Student Journal of Occupational Therapy"; "Fall 2020, Innaugural Issue" is the title of the volume. (In our implementation, we treat it as the title of an article, that spans the entire volume.)

opoudjis commented 1 year ago

New Series disambiguation:

I think this is a lost cause based on the quality of data in Crossref.

https://www.cambridge.org/core/journals/journal-of-the-royal-asiatic-society

This journal has changed numbering four times:

1827: Transactions of the Royal Asiatic Society 1834: Journal of the Royal Asiatic Society of Great Britain and Ireland 1863: Journal of the Royal Asiatic Society of Great Britain and Ireland, New Series (i.e. start numbering again from vol 1) 1990: Journal of the Royal Asiatic Society

We could introduce disambiguation by series. We can't:

1907: Journal of the Royal Asiatic Society of Great Britain & Ireland 1888: Journal of the Royal Asiatic Society, vol 20 (counting from 1863) 1843: Journal of the Royal Asiatic Society, vol 7 (counting from 1834)

So in fact, we can't disambiguate the two runs of "Journal of the Royal Asiatic Society of Great Britain & Ireland", when most of the journals have the post-1990 journal title instead! In fact, I can't see that Crossref even has any entries at all for Journal of the Royal Asiatic Society as a journal, as opposed to the individual articles.

We'll just have to alert users: if they're aware of a New Series needing to be indicated for a journal, they should not trust DOI referencing. It's not the end of the world if they do (people will still work out what is going on from the dates), but it is not welcome.

opoudjis commented 1 year ago

On occasion, relations to other items are asserted in the record, though not nearly often enough. When they are, they involve other DOIs. This occurs in https://api.crossref.org/v1/works/10.1111/jan.15115/v3/decision1

We should action these. The allowed relations, and their relaton equivalents, are:

https://www.crossref.org/documentation/schema-library/markup-guide-metadata-segments/relationships/

(Note that in the JSON, the types are hyphenated instead of camel cased: is-cited-by instead of isCitedBy.)

Any relations not modelled in Relaton ("---") shall be rendered as "related", with the Crossre text in relation/description

hasComplement replaces the erroneous complements in Relaton.

opoudjis commented 1 year ago

For conferences (proceedings-article and proceedings), event.name, event.location, event.acronym may need to be modelled in Relaton, they currently are not; e.g. for https://api.crossref.org/v1/works/10.1109/icpadm.1994.414074

"event":{
         "name":"1994 4th International Conference on Properties and Applications of Dielectric Materials (ICPADM)",
         "location":"Brisbane, Qld., Australia",
         "acronym":"ICPADM-94"
      },
andrew2net commented 1 year ago
BIBTEX rendering freezes

@opoudjis the document has no title or container title. It has "project": ["project-title": ["title": "Single .... Can we use the project-title as a title?

andrew2net commented 1 year ago

Don't use just filter=type:book, because of the multiple variants they have encoded; use:

filter=type:book&filter=type:book-set&filter=type:edited-book&filter=type:monograph&filter=type:reference-book

@opoudjis when I use all the types in the filter the crossref uses only the last one, i.e. "reference-book".

UPD multiple filters usage is:

https://api.crossref.org/works?query=The+Haiti+Reader+2020&filter=type:book,type:book-set,type:edited-book,type:monograph,type:reference-book
opoudjis commented 1 year ago
BIBTEX rendering freezes

@opoudjis the document has no title or container title. It has "project": ["project-title": ["title": "Single .... Can we use the project-title as a title?

Yes. Grants are to receive funding for projects, so the project title is indeed the title. I haven't modelled grants at all, so for now we'll leave it at that, but if I come up with any good ideas I'll let you know when I review the results.

opoudjis commented 1 year ago

Don't use just filter=type:book, because of the multiple variants they have encoded; use: filter=type:book&filter=type:book-set&filter=type:edited-book&filter=type:monograph&filter=type:reference-book

@opoudjis when I use all the types in the filter the crossref uses only the last one, i.e. "reference-book".

UPD multiple filters usage is:

https://api.crossref.org/works?query=The+Haiti+Reader+2020&filter=type:book,type:book-set,type:edited-book,type:monograph,type:reference-book

Oh, I misread the documentation, I thought it said that multiple filter arguments with the same type meant multiple query arguments. Thank you for working that out.

andrew2net commented 1 year ago

@opoudjis is "journal-issue" => "article" correct? In my mapping, it's "journal-issue" => "journal"

opoudjis commented 1 year ago

@opoudjis is "journal-issue" => "article" correct? In my mapping, it's "journal-issue" => "journal"

Yes, I changed my mind:

Change the Relaton mapping of journal-issue and journal-volume from journal to article

andrew2net commented 1 year ago

The Unicode in DOI JSON is encoded with \u strings, e.g. in https://api.crossref.org/v1/works/10.11606/t.8.2017.tde-08052017-100442 , São Paulo is encoded as "S\u00e3o Paulo". Please convert all such Unicode to UTF-8.

As per https://gist.github.com/andrewblim/3d00314e45cbcb064623#unicode-and-utf-8, the UTF-8 implements the Unicode standard. So no way to convert Unicode to UTF-8 since they are the same. As described here https://www.honeybadger.io/blog/ruby-unicode-normalization/, there are two ways to write characters: as a codepoint and as a composition. In this example "ã" is presented as a composition of "a" and "\u0303". We can convert it to a codepoint, i.e. to a single "\u00F3" code. Or, we can remove the "\u0303" so the "ã" becomes just "a".

opoudjis commented 1 year ago

No, all I was saying was, to ensure that the Unicode I get is UTF-8, with the \u strings resolved: I don't want to see "S\u00e30", a string with 8 ASCII characters, in the payload, but "São", a string with 3 UTF-8 characters. If you're already doing that, good.

andrew2net commented 1 year ago

We'll just have to alert users: if they're aware of a New Series needing to be indicated for a journal, they should not trust DOI referencing. It's not the end of the world if they do (people will still work out what is going on from the dates), but it is not welcome.

@opoudjis should we warn users in a log when a title contains "of the Royal Asiatic Society " text?

andrew2net commented 1 year ago
  • isIdenticalTo: equivalent

@opoudjis we don't have an "equivalent" in the grammar. How should we handle the relation type?

opoudjis commented 1 year ago

We'll just have to alert users: if they're aware of a New Series needing to be indicated for a journal, they should not trust DOI referencing. It's not the end of the world if they do (people will still work out what is going on from the dates), but it is not welcome.

@opoudjis should we warn users in a log when a title contains "of the Royal Asiatic Society " text?

No. The point of this is, that Crossref does not give you enough information to know that the numbering of a journal has restarted from 1 (because it gets this instance so wrong), so you simply cannot implement the /series/run feature.

opoudjis commented 1 year ago
  • isIdenticalTo: equivalent

@opoudjis we don't have an "equivalent" in the grammar. How should we handle the relation type?

Treat it like the other "---":

Any relations not modelled in Relaton ("---") shall be rendered as "related", with the Crossre text in relation/description

opoudjis commented 1 year ago

Hi Nick, please try the relaton-doi GitHub version. I've implemented all functionality and tests

opoudjis commented 1 year ago

book-chapter.xml 1:: /series is wrong to map from the host item title in book chapters. I have already pointed it out: the series of a book, and thus of a book chapter, is something distinct from the book title, and although Crossref does not include this information, other sources will. Remove it from book chapters: the book title must only be included in the host item. This is different behaviour from articles.

book-chapter.xml 2:: It is not true that city = Berlin and region = New York. That is a possible interpretation of "A, B" in places, but in this instance, the comma means "and" instead. Because of the bad quality of data here, do not parse the string, just put it in "city" unparsed.

book-chapter.xml 3:: I wanted to suggest changing the publisher name (DE GRUYTER MOUTON) from all caps; but we just cannot trust that there won't be acronyms in there. So leave it alone.

book-chapter.xml 4:: The source has "link":[{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9783110889406.257\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}]. However, I see from https://www.crossref.org/documentation/schema-library/markup-guide-metadata-segments/full-text-urls/ that the similarity-checking URIs are used for deduplication by Crossref, they are not actually intended for public use (and presumably won't resolve.) Accordingly, if you see any "intended-application":"similarity-checking" under "link", continue to IGNORE THAT URI.

book-part.xml 1:: /series is wrong to include in book parts, and to all items that map to inbook or inproceedings.

book-part.xml 2:: If an included item has no authors or editors, and the host item does, then you should copy the host item authors to the included item, as you have done here; but you must also keep the host item authors where they are. So Dubois, Glover, etc must appear as editors both of the chapter, and of the book "The Haiti Reader".

book-section.xml 1:: /series is wrong to include in book parts, and to all items that map to inbook or inproceedings.

book-section.xml 2:: "resource":{"primary":{"URL":"http:\/\/www.dggs.alaska.gov\/pubs\/id\/23007"}} contains a URL as well.

book-series.xml 1:: Again, URL in resource":{"primary":{"URL":"https:\/\/www.oecd-ilibrary.org\/nuclear-energy\/chemical-thermodynamics_20743300"}}

book-series.xml 2:: I would include ISSN as a docidentifier: "ISSN":["2074-3300"]. You are already including ISBN. (I know book series should not have ISSNs... except, they are series. Of Books.)

book-set.xml 1:: URL in "resource":{"primary":{"URL":"https:\/\/library.aota.org\/Occupational-Therapy-Manager-6"}}

book-track.xml 1:: We cannot no longer trust that Place = City, Region, as we saw above with "Berlin, New York", so do not parse the city unless the region is completely clear as being a region (i.e., unless you are treating all caps as region abbreviations, or else have a checklist of countries like USA).

book-track.xml 2:: Do not repeat the book title as the series in any inbook item.

book-track.xml 3:: URL in "resource":{"primary":{"URL":"http:\/\/hsus.cambridge.org\/HSUSWeb\/toc\/tableToc.do?id=Eh132-135"}}

book-track.xml 4: Editors are missing for the parent item: https://hsus.cambridge.org/HSUSWeb/toc/tableToc.do?id=Eh132-135 shows that the full citation is:

Ransom, Roger L. , “Wholesale commodity price indexes in Richmond, the eastern Confederacy, New York, and San Francisco: 1861–1865.” Table Eh132-135 in Historical Statistics of the United States, Earliest Times to the Present: Millennial Edition, edited by Susan B. Carter, Scott Sigmund Gartner, Michael R. Haines, Alan L. Olmstead, Richard Sutch, and Gavin Wright. New York: Cambridge University Press, 2006. http://dx.doi.org/10.1017/ISBN-9780511132971.Eh111-193

... Unfortunately, I don't think that we can recover that information from Crossref. Can you confirm? If we can't, we can't...

book-track.xml 5:: We also don't see to have the actual extent identifier, Table Eh132-135, in the Crossref record.

book.xml 1:: URL in "resource":{"primary":{"URL":"https:\/\/academic.oup.com\/book\/8112"}}

book.xml 2:: IGNORE the created date. Reading between the lines, it is clearly the date that the Crossref metadata record was created, not the date that the resource was created. (Confirmed in "Note on dates": "from-created-date:2016-02-29,until-created-date:2016-02-29 filters works first deposited on 29th February 2016".) That is why in this file, the created date is later than the published date! If it is present, use the "content-created" date for "created" instead. (I have not removed the created date consistently from all fixtures, but please do so.)

book_chapter_editiors.xml 1:: You have recorded the PDF URI. There is an additional URI: "resource":{"primary":{"URL":"http:\/\/content.apa.org\/books\/16096-016"}}

opoudjis commented 1 year ago

In order to prevent confusion, I'm PR'ing the corrections to the fixtures.

opoudjis commented 1 year ago

component.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pone.0020476.s005"}}

crossref_bipm.xml 1:: Conversely, in articles, we do not want "includedIn", although it is not as harmful here as series is in inbook, inpresentation.

crossref_ieee.xml 1:: Exceptionally to what I said above, we can processes place = City, Region, if the Region follows certain patterns: initials of US states is one of them, and in general two or three capital letters can be treated as a region abbreviation. Stripping USA, as you have been doing, is a nice touch!

crossref_ieee.xml 2:: We actually now have a counterpart to "standards-body": /contributor[role/@type = 'authorizer']

crossref_ieee.xml 3:: IEEE is a single string under publisher, and an acronym field under standards-body. That's just poor modelling and/or poor data entry, and don't bother trying to make it consistent, it will just be a known issue.

crossref_ieee.xml 4:: There is a URI in "resource":{"primary":{"URL":"https:\/\/ieeexplore.ieee.org\/document\/6835311"}}. I don't think the fact that it is of type unspecified should be preventing us from including it. As I noted already, ignore "link":[{"URL":"http:\/\/xplorestaging.ieee.org\/ielx7\/6835309\/6835310\/06835311.pdf?arnumber=6835311","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}]: we ignore similarity-checking links in general.

crossref_nist.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/nvlpubs.nist.gov\/nistpubs\/ir\/2019\/NIST.IR.8245.pdf"}}

crossref_nist.xml 2:: We have representation of the department of NIST responsible for the publication: "message":{"institution":[{"name":"National Institute of Standards and Technology","acronym":["NIST"],"place":["Gaithersburg, MD"],"department":["Information Technology Laboratory","Material Measurement Laboratory"]}]. We can put that inside the organisation representation in relaton, as /organization/subdivision. The problem is, I don't know whether we can copy that information across to the publisher: "message.institution" does not say anything clear to me, and the API (as opposed to the XML for submissions) is not particularly documented.

crossref_nist.xml 2:: We have a contributor type for funders now, "funder":[{"DOI":"10.13039\/100007764r","name":"Information Technology Laboratory","doi-asserted-by":"publisher"}]: contributor[role/@type = 'enabler']. This is poor data, given that the "Information Technology Laboratory" is in reality a department of NIST, but I think you should insert it anyway.

crossref_rfc.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/www.rfc-editor.org\/info\/rfc0001"}}

database.xml 1:: I said to ignore the created date, but don't ignore it if no other date is provided.

database.xml 2:: URI in resource":{"primary":{"URL":"http:\/\/central.proteomexchange.org\/PXD038478"}}

dataset.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/referenceworks.brillonline.com\/entries\/encyclopaedia-of-islam-1\/*-SIM_5628"}}

dissertation.xml 1:: The authorizer is there in "message":{"institution":[{"name":"Universidade de S\u00e3o Paulo","place":["S\u00e3o Paulo"],"department":["Faculdade de Filosofia, Letras e Ci\u00eancias Humanas"]}] (it's the specific department of the university that the thesis was done in). But because message.institution is so vaguely defined, I don't think we can use it.

dissertation.xml 2:: The kind of dissertation is included in "degree":["Doutorado em Antropologia Social"]. Let's include that under /medium/genre.

dissertation.xml 3:: We have "resource":{"primary":{"URL":"http:\/\/www.teses.usp.br\/teses\/disponiveis\/8\/8134\/tde-08052017-100442\/"}} and "link":[{"URL":"http:\/\/www.teses.usp.br\/teses\/disponiveis\/8\/8134\/tde-08052017-100442\/publico\/2017_HomeroMoroMartins_VCorr.pdf","content-type":"application\/octet-stream","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/www.teses.usp.br\/teses\/disponiveis\/8\/8134\/tde-08052017-100442\/publico\/2017_HomeroMoroMartins_VCorr.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}]. As it turns out, we should disprefer links used for similarity checking or text mining, in favour of resource URIs: the similarity checking or text mining URIs (a) may not be public, (b) may not be stable, and (c) are likely not how publishers prefer to expose their works.

opoudjis commented 1 year ago

edited-book.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9780691229409\/html"}}

edited-book.xml 2:: It's great that you've retrieved the related-to item, but making the DOI be the formattedref is misleading; please look up the title of the relaton DOI instead. (Yes, it is the same, but I'm not requiring you to provide any further information.)

grant.xml 1:: The investigator and lead investigator are the grant equivalent of authors: "investigator":[{"given":"Chengyuan","family":"Wang","affiliation":[]},{"given":"Harry","family":"Scott","affiliation":[]},{"given":"Irina","family":"Novikova","affiliation":[]},{"given":"Vadim","family":"Molodtsov","affiliation":[]},{"given":"Zhou","family":"Yin","affiliation":[]}],"lead-investigator":[{"given":"Richard","family":"Ebright","affiliation":[]}]. Put the investigator and lead-investigator as the role/description, and put the lead investigator first.

grant.xml 2:: The funding.funder "funding":[{"type":"award","funder":{"name":"US Department of Energy","id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}}]}] is the contributor of type "enabler"

grant.xml 3:: URI in "resource":{"primary":{"URL":"https:\/\/www.osti.gov\/award-doi-service\/biblio\/10.46936\/cpcy.proj.2019.50733\/60006578"}}

journal-article.xml 1:: You do not need to include relaton[@type = 'includedIn'] for journal articles, as already noted.

journal-article.xml 2:: We know from the article PDF that it is pp 543–565: https://www.degruyter.com/document/doi/10.1515/text.2001.011/html . That information is nowhere to be found in the Crossref entry, so this is poor data quality, and we can't do anything about it.

journal-article.xml 3:: We have all-caps author names. In the case of authors as opposed to organisations, I think we can attempt to fix this. If a forename or surname is all-capitals and more than 2 characters long, make it titlecase. If there are multiple space-delimited strings in a name field, and any of them is more than 2 characters long, and not terminated by a period, do that to each. So "YU. JOSEPH" => "YU. Joseph". Yes, we will get incorrect cases like "MCDONALD" => "Mcdonald", but it's better than seeing a bunch of allcaps.

journal-article.xml 4:: URI in "resource":{"primary":{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/text.2001.011\/html"}}

journal-issue-1.xml 1:: Again, from https://www.degruyter.com/document/doi/10.1515/COG.2007.005/html, the page numbers are 133-152, and this does not show up in the reference. (It only shows up in an article this article cites, and the citation is completely mangled.) This is again poor quality data.

journal-issue-2.xml 1:: URI is in "resource":{"primary":{"URL":"http:\/\/doi.wiley.com\/10.1111\/read.1991.25.issue-1"}}

journal-issue-2.xml 2:: In issues, we do not expect page numbers. Misleadingly, the URI landing page does refer to page numbers, but we will ignore that.

journal-issue-2.xml 3:: We have two ISSNs, one for print and one for electronic: "ISSN":["1741-4350","1741-4369"],"issn-type":[{"value":"1741-4350","type":"print"},{"value":"1741-4369","type":"electronic"}] Include both. We should be typing them as "ISSN.print" and "ISSN.electronic", but I'm reluctant to do that.

journal-volume.xml 1:: Oh dear. We have redundant includedIn and series as Student Journal of Occupational Therapy. We also have the journal volume title and the includedIn as Fall 2020, Innaugural Issue. Both includedIns are redundant, and need to go.

journal-volume.xml 2:: "short-container-title":["SJOT"], for journal articles (and volumes and issues and journals) maps to /series/abbreviation

journal-volume.xml 3:: URI in "resource":{"primary":{"URL":"https:\/\/soar.usa.edu\/sjot\/vol1\/"}}

journal.xml 1:: include ISSN: "ISSN":["2278-4632"],"issn-type":[{"value":"2278-4632","type":"print"}]

journal.xml 2:: URI in "resource":{"primary":{"URL":"http:\/\/www.junikhyat.com\/"}}

journal.xml 3:: "short-title":["Juni Khyat"], we might as well include as /title[@type = 'short']

opoudjis commented 1 year ago

monograph-1.xml 1:: As above, do not break place "Berlin, New York" down into city Berlin and region New York.

monograph-1.xml 2:: URI in "resource":{"primary":{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9783110889406\/html"}}

monograph-2.xml 1:: publisher-location":"Nu\u0308rnberg :", and "publisher":"Verlag von Bauer und Raspe (Julius Merz),": Poor data, but could you clean it up? Detect trailing punctuation and space and strip them?

monograph-2.xml 2:: "title":["Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen \/"]. I'm reluctant to say this, but.... yeah, we need to do that clean up in titles as well. space punct should be removed at the end of titles as well, and punctuation includes forward slash. (In the source data, the forward slash would have been a delimiter from other information.)

monograph-2.xml 3:: URI in "resource":{"primary":{"URL":"http:\/\/www.biodiversitylibrary.org\/bibliography\/124254"}}

other.xml 1:: This is the same file as monograph-2, replace it with DOI: 10.1108/oxan-es268033, as specified above.

peer-review.xml 1:: We don't have a model for now for the information in "review":{"type":"editor-report","running-number":"E1V3","revision-round":"3","stage":"pre-publication"}, but we may come back to this.

peer-review.xml 2:: URI in "resource":{"primary":{"URL":"https:\/\/publons.com\/publon\/50403050"}}

peer-review.xml 3:: editor":[{"family":"Debra Jackson","sequence":"first","affiliation":[]}] is poor quality data. We're going to be tempted to fix this, to forename Debra surname Jackson. I don't trust the data enough to do that.

peer-review.xml 4:: Again, I would rather put the title of the related item, than put the DOI as a formattedref.

peer-review.xml 5:: Meta-comment. This turns out to be just a letter from a reviewer to a journal about whether the article should be published. Where I come from, those are confidential, but I guess the world has changed. It's kind of silly to put a DOI on these, but at least that explains why there is no extent on the entry: it's published at the URI.

posted-content.xml 1:: The abstract has JATS tags, and while I don't blame you for ignoring them, the result is that they are being concatenated into the text. <jats:title>ABSTRACT<\/jats:title><jats:p>Lsr2-like => "ABSTRACTLsr2-like". We also have in<jats:italic>Mycobacterium tuberculosis<\/jats:italic>, so the data is clearly badly mangled, and is skipping spaces it should have. I don't want to make up rules for every random schema embedded in Crossref, and there are cases where we won't want to space-delimit them (H<sup>2</sup>O). But... ... no. The lack of space around italicised words is poor data. We will have to treat this as a known issue.

posted-content.xml 2:: URI in "resource":{"primary":{"URL":"http:\/\/biorxiv.org\/lookup\/doi\/10.1101\/751156"}}

posted-content.xml 3:: again, the relation DOI should not be used as a formattedref, you should supply the title instead

posted-content.xml 4:: As already requested, change the bibitem type from social-media to dataset

posted-content.xml 5:: I don't know what to do with "message":{"institution":[{"name":"bioRxiv"}] or "group-title":"Microbiology" for now. In addition, posted-content in this instance is a preprint, which I would regard as the same type as the published item, i.e. an article. Just putting here as a note.

opoudjis commented 1 year ago

proceedings-article.xml 1:: Remove series duplicating includedIn, as with inBook

proceedings-article.xml 2:: URI in "resource":{"primary":{"URL":"http:\/\/ieeexplore.ieee.org\/document\/414074\/"}}

proceedings-article.xml 3:: From the URI, the extent of this is vol. 2, pp. 570-573. Neither information is included in the Crossref entry, so again, poor data in Crossref.

proceedings-article.xml 4:: There is data about the conference itself, which we are not (yet) using in relaton: "event":{"name":"1994 4th International Conference on Properties and Applications of Dielectric Materials (ICPADM)","location":"Brisbane, Qld., Australia","acronym":"ICPADM-94"}

proceedings-series.xml 1:: Again, there is data about the conference itself, which we are not (yet) using in relaton: "event":{"name":"Psychosocial Risks in Education and Quality Educational Processes","acronym":"CIPE 2020"}

proceedings-series.xml 2:: URI in "resource":{"primary":{"URL":"https:\/\/europeanproceedings.com\/book-series\/EpSBS\/books\/vol109-cipe-2020"}

proceedings-series.xml 3:: ISSN in "ISSN":["2357-1330"],"issn-type":[{"value":"2357-1330","type":"print"}]}

proceedings.xml 1:: Again, there is data about the conference itself, which we are not (yet) using in relaton: "event":{"name":"the 2011 International Conference","location":"Rourkela, Odisha, India","acronym":"ICCCS '11","number":"2011","start":{"date-parts":[[2011,2,12]]}

proceedings.xml 2:: URI in "resource":{"primary":{"URL":"http:\/\/portal.acm.org\/citation.cfm?doid=1947940"}}

reference-book.xml 1:: URI in "resource":{"primary":{"URL":"https:\/\/www.taylorfrancis.com\/books\/9781439864852"}}

reference-entry.xml 1:: Again, for an inbook reference, do not repeat the includedIn in the series.

reference-entry.xml 2:: URI in "resource":{"primary":{"URL":"http:\/\/www.ukwhoswho.com\/view\/10.1093\/ww\/9780199540891.001.0001\/ww-9780199540884-e-52741"}}

reference-entry.xml 3:: As far as I can tell, the API is making it completely impossible to retrieve the bibliographic item for the host bibliographic item, Who Was Who, so we cannot establish who its editors are. Oxford University Press is not showing a lot of interest in making its online acquisition citable, either. Can't be fixed, poor data. (At least the National Library of Australia deals with this by considering it a serial, i.e. a journal: https://catalogue.nla.gov.au/Record/2020752)

report-component.xml 1:: URI in "resource":{"primary":{"URL":"http:\/\/pubs.er.usgs.gov\/publication\/ofr72419"}}

report-component.xml 2:: The Author is "author":[{"name":"U.S. Geological Survey","sequence":"first","affiliation":[]}]. This is currently translated into a person name, but if we don't have a surname, I think we should assume the author is an organisation.

report-component.xml 3:: This is a component of a report, so we should treat the container-title as relation[role/@type = 'includedIn'] and not series. IN FACT, the provided field is series/name, and the alternative-id is the series/number, and this is a report and not a report component at all. But we don't know that in advance, and we have to take the provided type on face value; the includedIn assumption is safer for this bibliographic type. In general, Crosref has not properly modelled series at all, and we cannot trust container-title + alternative-id to give us series numbers.

report-component.xml 4:: ISSN in "ISSN":["2331-1258"],"issn-type":[{"value":"2331-1258","type":"print"}]

report-series.xml 1:: This... is infuriating. This item is a report, and the series is the container-title. But again, we have to take the bibliographic type on face value. For report-series, includedIn makes no sense, so make the container-title the series.

report-series.xml 2:: URI in "resource":{"primary":{"URL":"https:\/\/www.oecd-ilibrary.org\/finance-and-investment\/investment-treaties-and-shareholder-claims_5jxvk6shpvs4-en"}}

report-series.xml 3:: ISSN in "ISSN":["1815-1957"],"issn-type":[{"value":"1815-1957","type":"electronic"}]

report.xml 1:: URI in "resource":{"primary":{"URL":"http:\/\/pubs.er.usgs.gov\/publication\/i747"}}

standard.xml 1:: For standards, the standards-body is the authorizer contributor: "standards-body":{"name":"DIN Deutsches Institut f\u00fcr Normung e. V.","acronym":"DIN"}

standard.xml 2:: URI in "resource":{"primary":{"URL":"https:\/\/www.beuth.de\/de\/-\/-\/2204273"}

andrew2net commented 1 year ago

crossref_ieee.xml 2:: We actually now have a counterpart to "standards-body": /contributor[role/@type = 'authorizer']

@opoudjis but in the fixture we have type='publisher'. Which one is correct?

opoudjis commented 1 year ago

Ah. Sorry. OK, anything in standards-body should be mapped to authorizer. If no publisher is present in Crossref, make the same organisation also be the publisher.

andrew2net commented 1 year ago

book-track.xml 5:: We also don't see to have the actual extent identifier, Table Eh132-135, in the Crossref record.

@opoudjis it could be parsed from DOI 10.1017/isbn-9780511132971.eh132-135, or from resource: { primary: { URL: 'http://hsus.cambridge.org/HSUSWeb/toc/tableToc.do?id=Eh132-135' } }, or from URL http://dx.doi.org/10.1017/isbn-9780511132971.eh132-135

opoudjis commented 1 year ago

But that's a one off, it's not going to be reliably the case in other records. No, we'll just have to tell people that it's poor quality data. I've already created functionality for users to add fields to the records fetched from Relaton.

andrew2net commented 1 year ago

@opoudjis check the last updete please. Pay attention that you need to update relaton-bib to v 1.14.4

opoudjis commented 1 year ago

Outstanding issues:

book-chapter.xml 2:: It is not true that city = Berlin and region = New York. That is a possible interpretation of "A, B" in places, but in this instance, the comma means "and" instead. Because of the bad quality of data here, do not parse the string, just put it in "city" unparsed.

You have parsed this as two cities:

 <place>
    <city>Berlin</city>
  </place>
  <place>
    <city>New York</city>
  </place>

My preference is not to parse it,

<place>
    <city>Berlin, New York</city>
</place>

simply because of how bad the quality of data is. But I'm not insisting for now.

book-series.xml 2:: I would include ISSN as a docidentifier: "ISSN":["2074-3300"]. You are already including ISBN. (I know book series should not have ISSNs... except, they are series. Of Books.)

I see you are using type ISSN.electronic on docidentifier. For now, we will leave that alone, rather than introducing an extra attribute on docidentifier to break up ISSN and electronic. We might do that later. But I will update metanorma to ignore ISSN.xxx as well as ISSN in Metanorma processing.

ronaldtse commented 1 year ago

I also prefer not using any capitalization in keys. issn over ISSN.

opoudjis commented 1 year ago

I also prefer not using any capitalization in keys. issn over ISSN.

Separate ticket plox.

opoudjis commented 1 year ago

Outstanding issues:

crossref_nist.xml 2:: We have a contributor type for funders now, "funder":[{"DOI":"10.13039\/100007764r","name":"Information Technology Laboratory","doi-asserted-by":"publisher"}]: contributor[role/@type = 'enabler']. This is poor data, given that the "Information Technology Laboratory" is in reality a department of NIST, but I think you should insert it anyway.

Nice-to-have, but would like this.

Thank you, @andrew2net, very high quality work!

opoudjis commented 1 year ago

@andrew2net Thank you very much for your work on this very challenging gem. Please release when you feel ready, and I'm now happy for you to close this ticket.