tetherless-world / nanomine-graph

the visualization web app for nanomine project
MIT License
1 stars 4 forks source link

Retrieving author affiliations for a given DOI? #26

Open mdeagen opened 4 years ago

mdeagen commented 4 years ago

Author affiliations for a DOI appear to be connected to the publisher, rather than the DOI itself.

Example SPARQL query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT * WHERE {
  <http://dx.doi.org/10.1016/j.eurpolymj.2008.06.015> dct:isPartOf [ dct:publisher [ prov:atLocation ?place ]]
}

Desired query result for this DOI is:

Department of Chemistry, Center for Nanotechnology at CYCU and R&D Center for Membrane Technology, Chung-Yuan Christian University, Chung Li 32023, Taiwan, ROC

Actual query result is 54 distinct place URIs within the knowledge graph that are connected to the same publisher URI, which in this case is publisher:elsevier.

jpmccu commented 4 years ago

I guess the XML structure implied that the location was for the publisher (to me), not the author(s). Usually an affiliation is associated per-author, not per paper. We (at least in our work) often have multi-institution papers (Nanomine being a perfect example).

On Wed, Jul 15, 2020 at 11:08 AM mdeagen notifications@github.com wrote:

Author affiliations for a DOI appear to be connected to the publisher, rather than the DOI itself.

Example SPARQL query:

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX prov: http://www.w3.org/ns/prov# PREFIX dct: http://purl.org/dc/terms/ SELECT DISTINCT * WHERE { http://dx.doi.org/10.1016/j.eurpolymj.2008.06.015 dct:isPartOf [ dct:publisher [ prov:atLocation ?place ]] }

Desired query result for this DOI is:

Department of Chemistry, Center for Nanotechnology at CYCU and R&D Center for Membrane Technology, Chung-Yuan Christian University, Chung Li 32023, Taiwan, ROC

Actual query result is 54 distinct place URIs within the knowledge graph that are connected to the same publisher URI, which in this case is publisher:elsevier.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tetherless-world/nanomine-graph/issues/26, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEOOPEEC3GRZCDNATG3R3XA6RANCNFSM4O2UBPPQ .

-- Jim McCusker

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute mccusj2@rpi.edu mccusj@cs.rpi.edu http://tw.rpi.edu

mdeagen commented 4 years ago

You are correct, the affiliations within a DOI should be linked to the respective authors. However, the affiliation(s) for an author should be resolvable to a specific DOI (since author affiliations can change over time).

Should we bypass the XML and use an intelligent agent on the KG in this case? The DOI alone should be sufficient to curate the authors+affiliation information using an external DB (like SemanticScholar), or alternatively scraped from the DOI's URL.

jpmccu commented 4 years ago

I think we should be grabbing the metadata directly from the DOI linked data instead of using the XML data. It's actually got real identifiers for most authors, including orcids when available.

On Wed, Jul 15, 2020 at 1:00 PM mdeagen notifications@github.com wrote:

You are correct, the affiliations within a DOI should be linked to the respective authors. However, the affiliation(s) for an author should be resolvable to a specific DOI (since author affiliations can change over time).

Should we bypass the XML and use an intelligent agent on the KG in this case? The DOI alone should be sufficient to curate the authors+affiliation information using an external DB (like SemanticScholar), or alternatively scraped from the DOI's URL.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tetherless-world/nanomine-graph/issues/26#issuecomment-658883784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEOLL4G4D5QX4AZVEXLR3XOEPANCNFSM4O2UBPPQ .

-- Jim McCusker

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute mccusj2@rpi.edu mccusj@cs.rpi.edu http://tw.rpi.edu

mdeagen commented 4 years ago

If there is no freely available DOI metadata API that meets our purposes, we may be able to adapt the doi-crawler that Bingyin developed (web-scraper with configurations for several journal web pages). Instead of XML output, it would be configured to generate RDF directly.

How to best model the DOI-->AuthorURI-->AffiliationURI relationship?

Here is a recommendation from DublinCore's citation guidelines: image

However, this approach would not resolve individual author affiliations for a multi-author, multi-institution work. What would be the preferred predicate for AuthorURI-->AffiliationURI triples (purple dashed arrows below)?

image

jpmccu commented 4 years ago

It's simpler than that. Content negotiate text/turtle against http://dx.doi.org/{{doi}} and you'll get all of that.

Jim

On Wed, Jul 15, 2020 at 4:20 PM mdeagen notifications@github.com wrote:

If there is no freely available DOI metadata API that meets our purposes, we may be able to adapt the doi-crawler that Bingyin developed https://github.com/bingyinh/doi-crawler (web-scraper with configurations for several journal web pages). Instead of XML output, it would be configured to generate RDF directly.

How to best model the DOI-->AuthorURI-->AffiliationURI relationship?

Here is a recommendation from DublinCore's citation guidelines: [image: image] https://user-images.githubusercontent.com/43749866/87589407-304cb300-c6b3-11ea-97e8-4a68fb6cca1f.png

However, this approach would not resolve individual author affiliations for a multi-author, multi-institution work. What would be the preferred predicate for AuthorURI-->AffiliationURI triples (purple dashed arrows below)?

[image: image] https://user-images.githubusercontent.com/43749866/87591721-bb7b7800-c6b6-11ea-9802-d4d6491812cc.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tetherless-world/nanomine-graph/issues/26#issuecomment-658987784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCENXFSOJFJC66SDLZH3R3YFO7ANCNFSM4O2UBPPQ .

-- Jim McCusker

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute mccusj2@rpi.edu mccusj@cs.rpi.edu http://tw.rpi.edu

mdeagen commented 4 years ago

Thanks for the tip! I wonder if we could import citation information in the KG using this method rather than converting from XML? (Would still need to do some federation of author URIs, but first/last name (ignoring middle initial) could work as a first approximation...)

Do you know of a service that provides author institution/affiliation with a similar request? Looks like institutions are not part of CrossRef.

For example, the following request: curl -LH "Accept: text/turtle;q=1.0" http://dx.doi.org/10.1109/TDEI.2014.004415 -o output.txt

returns this output:

<http://id.crossref.org/contributor/linda-s-schadler-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Schadler" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Linda S." ;
      <http://xmlns.com/foaf/0.1/name>
              "Linda S. Schadler" .

<http://id.crossref.org/contributor/brian-benicewicz-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Benicewicz" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Brian" ;
      <http://xmlns.com/foaf/0.1/name>
              "Brian Benicewicz" .

<http://id.crossref.org/issn/1070-9878>
      a       <http://purl.org/ontology/bibo/Journal> ;
      <http://prismstandard.org/namespaces/basic/2.1/issn>
              "1070-9878" ;
      <http://purl.org/dc/terms/title>
              "IEEE Transactions on Dielectrics and Electrical Insulation" ;
      <http://purl.org/ontology/bibo/issn>
              "1070-9878" ;
      <http://www.w3.org/2002/07/owl#sameAs>
              "urn:issn:1070-9878" .

<http://id.crossref.org/contributor/henrik-hillborg-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Hillborg" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Henrik" ;
      <http://xmlns.com/foaf/0.1/name>
              "Henrik Hillborg" .

<http://id.crossref.org/contributor/suvi-virtanen-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Virtanen" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Suvi" ;
      <http://xmlns.com/foaf/0.1/name>
              "Suvi Virtanen" .

<http://id.crossref.org/contributor/su-zhao-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Su Zhao" ;
      <http://xmlns.com/foaf/0.1/name>
              " Su Zhao" .

<http://id.crossref.org/contributor/timothy-m-krentz-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Krentz" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Timothy M." ;
      <http://xmlns.com/foaf/0.1/name>
              "Timothy M. Krentz" .

<http://id.crossref.org/contributor/j-keith-nelson-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Nelson" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "J. Keith" ;
      <http://xmlns.com/foaf/0.1/name>
              "J. Keith Nelson" .

<http://dx.doi.org/10.1109/TDEI.2014.004415>
      <http://prismstandard.org/namespaces/basic/2.1/doi>
              "10.1109/tdei.2014.004415" ;
      <http://prismstandard.org/namespaces/basic/2.1/endingPage>
              "570" ;
      <http://prismstandard.org/namespaces/basic/2.1/startingPage>
              "563" ;
      <http://prismstandard.org/namespaces/basic/2.1/volume>
              "21" ;
      <http://purl.org/dc/terms/creator>
              <http://id.crossref.org/contributor/brian-benicewicz-3u43yacan7302> , <http://id.crossref.org/contributor/linda-s-schadler-3u43yacan7302> , <http://id.crossref.org/contributor/henrik-hillborg-3u43yacan7302> , <http://id.crossref.org/contributor/suvi-virtanen-3u43yacan7302> , <http://id.crossref.org/contributor/j-keith-nelson-3u43yacan7302> , <http://id.crossref.org/contributor/timothy-m-krentz-3u43yacan7302> , <http://id.crossref.org/contributor/su-zhao-3u43yacan7302> , <http://id.crossref.org/contributor/michael-bell-3u43yacan7302> ;
      <http://purl.org/dc/terms/date>
              "2014-04"^^<http://www.w3.org/2001/XMLSchema#gYearMonth> ;
      <http://purl.org/dc/terms/identifier>
              "10.1109/tdei.2014.004415" ;
      <http://purl.org/dc/terms/isPartOf>
              <http://id.crossref.org/issn/1070-9878> ;
      <http://purl.org/dc/terms/publisher>
              "Institute of Electrical and Electronics Engineers (IEEE)" ;
      <http://purl.org/dc/terms/title>
              "Dielectric breakdown strength of epoxy bimodal-polymer-brush-grafted core functionalized silica nanocomposites" ;
      <http://purl.org/ontology/bibo/doi>
              "10.1109/tdei.2014.004415" ;
      <http://purl.org/ontology/bibo/pageEnd>
              "570" ;
      <http://purl.org/ontology/bibo/pageStart>
              "563" ;
      <http://purl.org/ontology/bibo/volume>
              "21" ;
      <http://www.w3.org/2002/07/owl#sameAs>
              <doi:10.1109/tdei.2014.004415> , <info:doi/10.1109/tdei.2014.004415> , <http://dx.doi.org/10.1109/tdei.2014.004415> .

<http://id.crossref.org/contributor/michael-bell-3u43yacan7302>
      a       <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/familyName>
              "Bell" ;
      <http://xmlns.com/foaf/0.1/givenName>
              "Michael" ;
      <http://xmlns.com/foaf/0.1/name>
              "Michael Bell" .
mdeagen commented 4 years ago

Follow-up on the concept map above... would prov:actedOnBehalfOf suffice for linking author plus reported affiliation?

Keeping author URIs from CrossRef could be beneficial as they are unique to the person and time of publication. If we only had global IDs such as ORCID, we would not be able to resolve author affiliation for a given DOI if, for example, the author had later moved to another institution they had collaborated with in an earlier publication.

As an example, returning a list of Authors and Affiliations for a given DOI:

SELECT * WHERE {
  <doi.org/10.1001/12345> dct:creator ?crossrefAuthURI;
                          dct:contributor ?affiliation .
  ?crossrefAuthURI prov:actedOnBehalfOf ?affiliation .
}

Where possible, CrossRef author URIs could be linked to their ORCIDs (using dct:identifier?). If no ORCID exists, we would revert to the NanoMine author URI.

Another example, returning list of DOIs and Affiliations for a given Author based on their ORCID:

SELECT ?doi ?affiliation WHERE {
  ?crossrefAuthURI dct:identifier <orcid.org/0000-12345> .  
  ?doi dct:creator ?crossrefAuthURI;
       dct:contributor ?affiliation .
  ?crossrefAuthURI prov:actedOnBehalfOf ?affiliation .
}
mdeagen commented 3 years ago

Following up on this issue, here is an example SPARQL query that shows the problem.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT * WHERE {
  ?doi a dct:BibliographicResource ;
       dct:isPartOf [ dct:title ?Journal; 
                      dct:publisher [ rdfs:label ?Publisher;
                                      prov:atLocation [ rdfs:label ?Location ] ] ] .
} VALUES ?doi { <http://dx.doi.org/10.1016/j.jeurceramsoc.2007.02.082> }

Because prov:atLocation stems from the node of a publisher URI, we lose the link between a ?doi and its ?Location (since multiple DOIs and/or journals can have the same publishing house).

PROPOSED SOLUTION: Move the "prov:atLocation" clause in xml_ingest.setl.ttl up two levels, such that prov:atLocation extends directly from the dct:BibliographicResource.

image

VERIFICATION: Use the following SPARQL query to verify:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT * WHERE {
  ?doi a dct:BibliographicResource ;
       dct:isPartOf [ dct:title ?Journal; 
                      dct:publisher [ rdfs:label ?Publisher ] ] ;
       prov:atLocation [ rdfs:label ?Location ]  .
} VALUES ?doi { <http://dx.doi.org/10.1016/j.jeurceramsoc.2007.02.082> }

The binding to ?Location should be the string "Microelectronics and Materials Physics Laboratories, EMPART Research Group of Infotech Oulu, P.O. Box 4500, FIN-90014 University of Oulu, Finland" to match the corresponding XML file.

jpmccu commented 3 years ago

If the location is the city of the publisher, wouldn't it be weird to say that a paper has a location though?

On Wed, Sep 1, 2021 at 10:40 AM mdeagen @.***> wrote:

Following up on this issue, here is an example SPARQL query that shows the problem.

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX dct: http://purl.org/dc/terms/ PREFIX prov: http://www.w3.org/ns/prov# SELECT * WHERE { ?doi a dct:BibliographicResource ; dct:isPartOf [ dct:title ?Journal; dct:publisher [ rdfs:label ?Publisher; prov:atLocation [ rdfs:label ?Location ] ] ] . } VALUES ?doi { http://dx.doi.org/10.1016/j.jeurceramsoc.2007.02.082 }

Because prov:atLocation stems from the node of a publisher URI, we lose the link between a ?doi and its ?Location (since multiple DOIs and/or journals can have the same publishing house).

PROPOSED SOLUTION: Move the "prov:atLocation" clause in xml_ingest.setl.ttl https://github.com/tetherless-world/nanomine-graph/blob/master/setl/xml_ingest.setl.ttl up two levels, such that prov:atLocation extends directly from the dct:BibliographicResource.

[image: image] https://user-images.githubusercontent.com/43749866/131689291-94bff541-dbdf-4646-8434-b689330c1abc.png

VERIFICATION: Use the following SPARQL query to verify:

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX dct: http://purl.org/dc/terms/ PREFIX prov: http://www.w3.org/ns/prov# SELECT * WHERE { ?doi a dct:BibliographicResource ; dct:isPartOf [ dct:title ?Journal; dct:publisher [ rdfs:label ?Publisher ] ] ; prov:atLocation [ rdfs:label ?Location ] . } VALUES ?doi { http://dx.doi.org/10.1016/j.jeurceramsoc.2007.02.082 }

The binding to ?Location should be the string "Microelectronics and Materials Physics Laboratories, EMPART Research Group of Infotech Oulu, P.O. Box 4500, FIN-90014 University of Oulu, Finland" to match the corresponding XML file https://materialsmine.org/nmr/xml/L102_S6_Hu_2007?format=xml.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tetherless-world/nanomine-graph/issues/26#issuecomment-910351622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEJCGUIJAUQSAGAHKF3T7Y3NTANCNFSM4O2UBPPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @. @.> http://tw.rpi.edu

mdeagen commented 3 years ago

The location being stored in the XML is not the city of the publisher. The xpath //Citation/CommonFields/Location is the affiliated author address populated into the XML. Theoretically there should be more than one (if so, we would need a for loop), but the scraper that populates the XML appears to only grab one, so the proposed fix should suffice for the current state of the XML representations.

jpmccu commented 3 years ago

Ah, then yes, moving it up makes sense. Things were ambiguous there.

On Wed, Sep 1, 2021 at 11:37 AM mdeagen @.***> wrote:

The location being stored in the XML is not the city of the publisher. The xpath //Citation/CommonFields/Location is the affiliated author address populated into the XML. Theoretically there should be more than one (if so, we would need a for loop), but the scraper that populates the XML appears to only grab one, so the proposed fix should suffice for the current state of the XML representations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tetherless-world/nanomine-graph/issues/26#issuecomment-910404656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEPYOADMUGPX72SOB2TT7ZCEZANCNFSM4O2UBPPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @. @.> http://tw.rpi.edu