Exposing container-title metadata for Manubot HTML pages

dhimmel commented 5 years ago

Currently, citing a Manubot URL returns JSON CSL like

[
  {
    "type": "webpage",
    "title": "Manubot Rootstock: Manuscript Title",
    "URL": "https://greenelab.github.io/manubot-rootstock/v/658bcd763deae50732867a471d38760a90b13641/",
    "shortTitle": "Manubot Rootstock",
    "language": "en-US",
    "author": [
      {
        "family": "Doe",
        "given": "John"
      },
      {
        "family": "Roe",
        "given": "Jane"
      }
    ],
    "issued": {
      "date-parts": [
        [
          "2019",
          2,
          1
        ]
      ]
    },
    "accessed": {
      "date-parts": [
        [
          "2019",
          2,
          6
        ]
      ]
    },
    "id": "s7XRFgWm"
  }
]

Command to create:

manubot cite url:https://greenelab.github.io/manubot-rootstock/v/658bcd763deae50732867a471d38760a90b13641/

With the current manubot cite command, this metadata is retrieved from Zotero's translation-server. Pandoc encodes most of the information picked up by translation-server. One field that does not get set in the CSL JSON is container-title. For example, we could set container-title to equal "Manubot Preprint" by setting a <meta> field in the HTML head.

We'd probably want to use Pandoc's --include-in-header to insert these <meta> statements. Are there other fields besides CSL JSON's container-title we want to set?

slochower commented 5 years ago

For example, we could set container-title to equal "Manubot Preprint" by setting a <meta> field in the HTML head.

What is the purpose of having "Manubot Preprint" as the container?

agitter commented 5 years ago

@slochower one purpose would be for Manubot manuscripts citing other Manubot manuscripts. For example, in this manuscript the reference looks like:

"Manubot Preprint" may signal to a reader this is an HTML manuscript as opposed to a different type of web citation.

@dhimmel I support adding the container-title. I can't think of any other appropriate fields. Volume, issue, page numbers, etc. don't apply.

The issue title implies this would only affect the HTML version of the manuscript, right?

slochower commented 5 years ago

What other things go into container title? For example, does “bioRxiv preprint” ever appear in container title?

dhimmel commented 5 years ago

Here's the output of manubot cite doi:10.1101/515643 (a bioRxiv preprint):

```json [ { "publisher": "Cold Spring Harbor Laboratory", "abstract": "Researchers in the life sciences are posting their work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the popularity and practical benefits of preprints are driving policy changes at journals and funding organizations, there is little bibliometric data available to measure trends in their usage. Here, we collected and analyzed data on all 37,648 preprints that were uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find that preprints on bioRxiv are being read more than ever before (1.1 million downloads in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of more than 2,100 per month. We also find that two-thirds of bioRxiv preprints posted in 2016 or earlier were later published in peer-reviewed journals, and that the majority of published preprints appeared in a journal less than six months after being posted. We evaluate which journals have published the most preprints, and find that preprints with more downloads are likely to be published in journals with a higher impact factor. Lastly, we developed Rxivist.org, a website for downloading and interacting programmatically with indexed metadata on bioRxiv preprints.", "DOI": "10.1101/515643", "type": "manuscript", "source": "Crossref", "title": "Tracking the popularity and outcomes of all bioRxiv preprints", "author": [ { "given": "Richard J.", "family": "Abdill" }, { "given": "Ran", "family": "Blekhman" } ], "issued": { "date-parts": [ [ 2019, 1, 13 ] ] }, "URL": "https://doi.org/gftzwz", "id": "IYwQbTVz" } ] ```

Note that container-title is not set, although we think this is probably a bioRxiv bug, see https://github.com/manubot/manubot/issues/16.

The CSL docs define container-title as:

title of the container holding the item (e.g. the book title for a book chapter, the journal title for a journal article)

However, it's important to note that we're not directly setting container-title. Instead, we are setting metadata fields that will get picked up by Zotero and populate certain Zotero metadata fields that will then get exported as container-title in CSL.

Perhaps instead or in addition, we want the CSL publisher field to be set to "Manubot"?

agitter commented 5 years ago

Perhaps instead or in addition, we want the CSL publisher field to be set to "Manubot"?

I'm not sure that I think of Manubot as a publisher. The journal is closer to my interpretation of what the "Manubot Preprint" should be.

The comparable meta field in a bioRxiv preprint is: <meta name="citation_journal_title" content="bioRxiv" />

This conversation prompted an idea for a workaround for the bug in https://github.com/manubot/manubot/issues/16. I'll post it there to keep this discussion focused.

dhimmel commented 5 years ago

I'm not sure that I think of Manubot as a publisher. The journal is closer to my interpretation of what the "Manubot Preprint" should be.

My only worry is whether all Manubot documents are "preprints". The user can always change the value if not, perhaps to "Manubot Document" or just "Manubot".

I do think Manubot is sort of the publisher. Perhaps "GitHub Pages" is the publisher or the source manuscript's GitHub account. We don't necessarily have to set metadata.

Currently, here is how the Meta Review shows up on Google Scholar:

The bibtex from Google Scholar is:

@article{himmelsteinopen,
  title={Open collaborative writing with Manubot},
  author={Himmelstein, Daniel S and Slochower, David R and Malladi, Venkat S and Greene, Casey S and Gitter, Anthony}
}

We can also look into getting the publication date set.

agitter commented 5 years ago

You're right that not all Manubot documents are preprints. "Manubot", "Manubot Document", or "Manubot Manuscript" (though not everything is a manuscript either) would be better.

👍 on setting the publication date as well.

slochower commented 5 years ago

I guess I'm confused by the analogy of "Manubot" as a container. I don't really think "Manubot" or even "Manubot Document" performs the same role as a book or a journal. In those cases, the container acts like an index, where you can find similar or related things. I don't have strong feelings on this, though.

dhimmel commented 5 years ago

PeerJ Preprints

view-source:https://peerj.com/preprints/27506/

PeerJ Preprints uses the Google Scholar meta tags rather and does not set Dublin Core meta tags:

<meta name="citation_title" content="Ten simple rules for better running.">
<meta name="citation_date" content="2019-01-30">
<meta name="citation_doi" content="10.7287/peerj.preprints.27506v1">
<meta name="citation_language" content="en">
<meta name="citation_pdf_url" content="https://peerj.com/preprints/27506.pdf">
<meta name="citation_fulltext_html_url" content="https://peerj.com/preprints/27506">
<meta name="citation_technical_report_number" content="e27506v1">
<meta name="citation_keywords" content="systematic reviews; meta-analysis; synthesis; running research; sport sciences; training; randomized control trials; principles; health">
<meta name="citation_technical_report_institution" content="PeerJ Preprints">
<meta name="citation_publisher" content="PeerJ Inc.">
<meta name="citation_issn" content="2167-9843">
<meta name="citation_author" content="Christopher J Lortie">
<meta name="citation_author_institution" content="UCSB, The National Center for Ecological Analysis and Synthesis, Santa Barbara, CA, USA">
<meta name="citation_author_email" content="chris@christopherlortie.info">
<meta name="citation_author" content="Andy Walshe">
<meta name="citation_author_institution" content="Jaybird Running Labs, Park City, Utah, USA">
<meta name="citation_author" content="Hoby Darling">
<meta name="citation_author_institution" content="Jaybird Running Labs, Park City, Utah, USA">
<meta name="citation_author" content="Jamie Parker">
<meta name="citation_author_institution" content="Jaybird Running Labs, Park City, Utah, USA">
<meta name="description" content="Running is a popular and in many respects intuitive sport. Nonetheless, an extensive body of research literature supports and examines the science of running performance. Here, we used meta-analyses and systematic reviews directly associated with running performance to qualitatively describe ten simple rules for better running. Better running is defined as increases in speed, endurance, or reduced likelihood of injury. The general hypothesis topologically examined was that there is sufficient aggregated evidence to leverage effort and interventions for increased performance in running. This hypothesis was supported with significant big-picture evidence for several pillars of better running including training, recovery, and phenomenological levers specific to this sport. These trends are simplified into ten simple rules for runners and researchers alike.">

Interesting that they use the citation_technical_report_* fields rather than citation_journal_title. Google Scholar's indexing guidelines state:

For journal and conference papers, provide the remaining bibliographic citation data in the following tags: citation_journal_title or citation_conference_title, citation_issn, citation_isbn, citation_volume, citation_issue, citation_firstpage, and citation_lastpage. Dublin Core equivalents are DC.relation.ispartof for journal and conference titles and the non-standard tags DC.citation.volume, DC.citation.issue, DC.citation.spage (start page), and DC.citation.epage (end page) for the remaining fields.

For theses, dissertations, and technical reports, provide the remaining bibliographic citation data in the following tags: citation_dissertation_institution, citation_technical_report_institution or DC.publisher for the name of the institution and citation_technical_report_number for the number of the technical report.

So seems like we may want to consider the technical report route.

bioRxiv

view-source:https://www.biorxiv.org/content/10.1101/515643v1

<meta name="type" content="article" />
<meta name="HW.identifier" content="/biorxiv/early/2019/01/13/515643.atom" />
<meta name="HW.pisa" content="biorxiv;515643v1" />
<meta name="DC.Format" content="text/html" />
<meta name="DC.Language" content="en" />
<meta name="DC.Title" content="Tracking the popularity and outcomes of all bioRxiv preprints" />
<meta name="DC.Identifier" content="10.1101/515643" />
<meta name="DC.Date" content="2019-01-13" />
<meta name="DC.Publisher" content="Cold Spring Harbor Laboratory" />
<meta name="DC.Rights" content="© 2019, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0/" />
<meta name="DC.AccessRights" content="restricted" />
<meta name="DC.Description" content="Researchers in the life sciences are posting their work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the popularity and practical benefits of preprints are driving policy changes at journals and funding organizations, there is little bibliometric data available to measure trends in their usage. Here, we collected and analyzed data on all 37,648 preprints that were uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find that preprints on bioRxiv are being read more than ever before (1.1 million downloads in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of more than 2,100 per month. We also find that two-thirds of bioRxiv preprints posted in 2016 or earlier were later published in peer-reviewed journals, and that the majority of published preprints appeared in a journal less than six months after being posted. We evaluate which journals have published the most preprints, and find that preprints with more downloads are likely to be published in journals with a higher impact factor. Lastly, we developed Rxivist.org, a website for downloading and interacting programmatically with indexed metadata on bioRxiv preprints." />
<meta name="DC.Contributor" content="Richard J. Abdill" />
<meta name="DC.Contributor" content="Ran Blekhman" />

<meta name="citation_title" content="Tracking the popularity and outcomes of all bioRxiv preprints" />
<meta name="citation_abstract" lang="en" content="&lt;p&gt;Researchers in the life sciences are posting their work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the popularity and practical benefits of preprints are driving policy changes at journals and funding organizations, there is little bibliometric data available to measure trends in their usage. Here, we collected and analyzed data on all 37,648 preprints that were uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find that preprints on bioRxiv are being read more than ever before (1.1 million downloads in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of more than 2,100 per month. We also find that two-thirds of bioRxiv preprints posted in 2016 or earlier were later published in peer-reviewed journals, and that the majority of published preprints appeared in a journal less than six months after being posted. We evaluate which journals have published the most preprints, and find that preprints with more downloads are likely to be published in journals with a higher impact factor. Lastly, we developed Rxivist.org, a website for downloading and interacting programmatically with indexed metadata on bioRxiv preprints.&lt;/p&gt;" />
<meta name="citation_journal_title" content="bioRxiv" />
<meta name="citation_publisher" content="Cold Spring Harbor Laboratory" />
<meta name="citation_publication_date" content="2019/01/01" />
<meta name="citation_mjid" content="biorxiv;515643v1" />
<meta name="citation_id" content="515643v1" />
<meta name="citation_public_url" content="https://www.biorxiv.org/content/10.1101/515643v1" />
<meta name="citation_abstract_html_url" content="https://www.biorxiv.org/content/10.1101/515643v1.abstract" />
<meta name="citation_full_html_url" content="https://www.biorxiv.org/content/10.1101/515643v1.full" />
<meta name="citation_pdf_url" content="https://www.biorxiv.org/content/biorxiv/early/2019/01/13/515643.full.pdf" />
<meta name="citation_doi" content="10.1101/515643" />
<meta name="citation_section" content="New Results" />
<meta name="citation_firstpage" content="515643" />
<meta name="citation_author" content="Richard J. Abdill" />
<meta name="citation_author_institution" content="University of Minnesota" />
<meta name="citation_author_email" content="rabdill@umn.edu" />
<meta name="citation_author_orcid" content="http://orcid.org/0000-0001-9565-5832" />
<meta name="citation_author" content="Ran Blekhman" />
<meta name="citation_author_institution" content="University of Minnesota" />
<meta name="citation_author_email" content="blekhman@umn.edu" />
<meta name="citation_author_orcid" content="http://orcid.org/0000-0003-3218-613X" />
<meta name="citation_date" content="2019-01-13" />

So bioRxiv is setting both Dublin Core and Google Scholar meta tags.

Manubot

view-source:https://greenelab.github.io/manubot-rootstock/v/f559600ff1965899b20874e71874794c05787087/

  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="author" content="John Doe" />
  <meta name="author" content="Jane Roe" />
  <meta name="dcterms.date" content="2019-02-06" />
  <meta name="keywords" content="markdown, publishing, manubot" />
  <title>Manubot Rootstock: Manuscript Title</title>

So there is a lot more to set. However, we should push for as much of this to be done by Pandoc. For example, should Pandoc set dc.date rather than dcterms.date? Actually it looks like the DCQ docs allow both:

<meta name="DC.element" content="Value" />
<meta name="DCTERMS.element" content="Value" />

From http://dublincore.org/documents/dces/:

The fifteen element "Dublin Core" described in this standard is part of a larger set of metadata vocabularies and technical specifications maintained by the Dublin Core Metadata Initiative (DCMI). The full set of vocabularies, DCMI Metadata Terms [DCMI-TERMS]

manubot / rootstock