ropensci / handlr

convert among citation formats
https://docs.ropensci.org/handlr
Other
38 stars 4 forks source link

Data loss on citeproc import #20

Open bwiernik opened 4 years ago

bwiernik commented 4 years ago
cp_txt <- '[  {"id":"JolySilencetablemanners2008","accessed":{"date-parts":[[2019,10,27]]},"author":[{"family":"Joly","given":"Janneke F."},{"family":"Stapel","given":"Diederik A."},{"family":"Lindenberg","given":"Siegwart M."}],"container-title":"Personality and Social Psychology Bulletin","container-title-short":"Pers. Soc. Psychol. Bull.","DOI":"10.1177/0146167208318401","ISSN":"0146-1672, 1552-7433","issue":"8","issued":{"date-parts":[[2008,8]]},"language":"en","page":"1047-1056","references":"Retraction published 2012, <i>Personality and Social Psychology Bulletin, 38</i>[10], 1378, https://doi.org/10.1177/0146167212462821","source":"Crossref","title":"Silence and table manners: when environments activate norms","title-short":"Silence and table manners","type":"article-journal","URL":"http://journals.sagepub.com/doi/10.1177/0146167208318401","volume":"34"}]'
cp_parsed <- citeproc_reader(cp_txt)
names(cp_parsed)

When reading the Citeproc/CSL JSON format, handlr currently discards any valid CSL variables that are not part of its internal Crosscite format. This seems quite suboptimal, because it means that handlr can really only properly work with Citeproc data for a small number of item types (pretty much just article-journal and webpage). For example, the genre and medium variables that are used to indicate the category for a report or thesis are discarded. The variable editor is used for books and book chapters. In the example data I provide above, the variable references is discarded.

If I were to generate a reference for this item using the American Psychological Association CSL style, it would be: Joly, J. F., Stapel, D. A., & Lindenberg, S. M. (2008). Silence and table manners: When environments activate norms. Personality and Social Psychology Bulletin, 34(8), 1047–1056. https://doi.org/10.1177/0146167208318401 (Retraction published 2012, Personality and Social Psychology Bulletin, 38[10], 1378, https://doi.org/10.1177/0146167212462821)

However, if I import the item to handlr, export to CSL JSON again, and render the citation, it's: Joly, J. F., Stapel, D. A., & Lindenberg, S. M. (2008). Silence and table manners: When environments activate norms. Personality and Social Psychology Bulletin, 34(8), 1047–1056. https://doi.org/10.1177/0146167208318401

The retraction information has been lost.

Other variables, such as annote, , genre, note, medium, collection-title, number, and illustrator are also all discarded on import For item types and fields that don't have a Crosscite analogue, it seems like it would be wise to store these in the item data (e.g., as csl_note, csl_medium) and map them to other formats at translation time as needed.

sckott commented 4 years ago

thanks for this @bwiernik - I definitely want to improve the citeproc reader/writer.

it seems like it would be wise to store these in the item data (e.g., as csl_note, csl_medium) and map them to other formats at translation time

can you explain what you mean here. i'm not sure I follow. what is csl_note and csl_medium?

bwiernik commented 4 years ago

By the way, I'm working on a package cslr that creates a class for citeproc-formatted data, similar to the BibEntry class in RefManageR, and provides import, management, sorting, and citation tools.

The list of CSL variables is given here: https://aurimasv.github.io/z2csl/typeMap.xml My suggestion is that, if there are fields that don't fit into the CrossCite format, they should be stored. For example, currently handlr will discard medium from a citerpoc JSON object if it is provided. Instead, I would recommend that these get stored, with the prefix csl_ to indicate they come from citeproc. So for example, if a citeproc file has specifies something for medium, that could get stored in the field csl_medium in the handl object.

For example:

[
  {"id":"CuttsHappiness2017",
    "abstract":"[truncated]",
    "accessed":{
      "date-parts":[[2019,10,26]]
      },
    "dimensions":"PT00H04M16S",
    "director":[{"family":"Cutts","given":"Steve"}],
    "issued":{
      "date-parts":[[2017,11,24]]
    },
    "medium":"Video",
    "publisher":"Vimeo",
    "source":"Vimeo",
    "title":"Happiness",
    "type":"motion_picture",
    "URL":"https://vimeo.com/244405542"}
]

Here, accessed, dimensions, director, medium, source, and URL will get dropped. These should either be mapped to appropriate fields (e.g., URL to b_url, director to author with a field to indicate the creator type) or stored as CSL-specific fields (e.g., csl_dimensions, csl_medium, csl_source).

A similar argument could be made for fields that are also specific biblatex, bibtex, or other formats and not represented in the Crosscite schema. In general, I think it would make sense to create a table that cross-references the fields for each data format (e.g., biblatex_urldatecsl_accessed). This table could then be used when converting fields from one data format to another. This could provide greater conversion fidelity versus relying on the limits of any particular data format.

I am happy to help create such a table for the formats handlr currently supports.

sckott commented 4 years ago

looks like you forgot to finish a thought:

So for example, if a citeproc

bwiernik commented 4 years ago

Sorry, fixed that.

sckott commented 4 years ago

thanks for the fix.

I think it would make sense to create a table that cross-references the fields for each data format

As you've probably seen, we do have some named lists, e.g, https://github.com/ropensci/handlr/blob/master/R/translations.R as converters between formats. A table would be good though.

I agree about not dropping fields, and assigning them a csl_ prefix.

sckott commented 4 years ago

@bwiernik Are you still interested in making that table?

bwiernik commented 4 years ago

Yes, I'm hoping to get to it in the next week or two.

sckott commented 4 years ago

Okay, thanks

sckott commented 4 years ago

notes:

google spreadsheet started in https://docs.google.com/spreadsheets/d/1p1XaEtTBU_CmZba0P8nGpIlqAS2A8r4ZUs-WJarKUxo/edit#gid=0 - then move to the package when more stable

bwiernik commented 4 years ago
sckott commented 4 years ago

thanks! Do you know where to get a complete list of JATS types?

bwiernik commented 4 years ago

There is the full list in the JATS spec https://groups.niso.org/apps/group_public/download.php/21030/ANSI-NISO-Z39.96-2019.pdf

sckott commented 4 years ago

i don't see a full list in there for @publication-type it only says on page 276:

Category of publication being cited (for example, “book”, “letter”, “review”, “journal”, “patent”,“report”, “standard”, “data”, “working-paper”).

bwiernik commented 4 years ago

Oh I see what you mean. Hmm. I’m not sure there is a formal list anywhere. Probably the best option would be to compile the converter programs, such as those listed in the Wiki article here https://en.wikipedia.org/wiki/Journal_Article_Tag_Suite, and see what conventions have emerged.

sckott commented 4 years ago

Okay, thanks - not sure we need to include JATS, but if its easy enough to do seems worth it

sckott commented 3 years ago

there's better support for citeproc now. im sure could be better, but need to submit a new version for other reasons, so moving this to the next milestone - still need to finish the crosswalk between all formats spreadsheet linked above and then implement using that here