samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
637 stars 174 forks source link

How to cite specs #179

Open magicDGS opened 7 years ago

magicDGS commented 7 years ago

How can I cite the current specs?

For instance, SAM was defined in http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp352, but the specifications have changed a lot since them. In addition, the SAM tags definitions are also in active development.

Thanks in advance!

michaelmhoffman commented 2 years ago

A Digital Object Identifier (DOI) for GA4GH standards would be nice and would lend some extra stability to the process. I'm not sure what is involved in minting one's own or getting someone else to mint them for GA4GH. Another option would be to deposit them at Zenodo.

jkbonfield commented 2 years ago

I think this is something to push upstream to GA4GH coordinators as we should have the same mechanism available to all published documents.

I like the DOI approach (or DOI plus something else).

susanfairley commented 2 years ago

Agree with @jkbonfield and @michaelmhoffman. I can see why a DOI approach would make sense and also having a consistent approach to our documents.

jmarshall commented 2 years ago

To summarise and expand on the discussion on slack: I think when we have discussed this issue in the past, the recommendation has been to cite the appropriate URL[^1] and/or an appropriate paper such as one of these primordial papers.

[^1]: Yes, I'm glossing over some details here :smile:

To the extent that DOIs are useful identifiers for use in papers' lists of citations,[^2] and to the extent that it is useful for papers to directly reference specification documents,[^3] it would be useful to mint DOIs for format and protocol specifications. it would be best to have a pan-GA4GH approach to this, so :+1: to addressing this via e.g. @ga4gh/TASC. I believe LSG has not done anything in this space to date; I wonder whether any other work streams have.

[^2]: Fairly uncontroversially accepted, I think. [^3]: More debatable IMHO. Using the 2009 paper as a proxy for “the SAM format” works pretty well in practice.

Options for GA4GH to investigate would appear to be:

  1. Registering GA4GH as a DOI Registration Agency, hence presumably getting our own 10.nnnn prefix and allocating DOIs ourselves. (Probably not a particularly attractive approach, but it would be worth investigating what other similar standards organisations do w.r.t. DOIs. In fact, it appears that that really is the complete list of RAs, so presumably this is off the table and similar standards organisations to us operate instead as members of one or other of the consortia RAs, probably Crossref, if at all.)

  2. Selecting an established RA to become a member of, and using their services to mint DOIs as appropriate. For example, as noted by @michaelmhoffman in the slack conversation, Crossref has been around for 20+ years, is a consortium of relevant looking academic societies and publishers, and has a line item for “Standards”.

    This does require a bit of thinking about. For example some of Crossref's membership obligations are somewhat onerous: e.g. in GA4GH's context, “use the DOI as the permanent link to the page” does not seem entirely appropriate.

    (Doing this involves spending money. At a glance: a little money for Crossref; rather more for DataCite.)

  3. Depositing specification documents at e.g. Zenodo. However note that this would surely be depositing a copy of (a particular version of) a specification document; it does not provide a stable DOI referring to the canonical “SAM format” or “BED specification” or etc. Moreover IMHO this level of informality and dependence on a third party would be unbecoming for GA4GH.

  4. Ensuring that there is always an up-to-date paper to be used as a proxy, instead of creating DOIs directly for formats and specifications. For example, GA4GH could write[^4] an annual survey article outlining its recent activities (similar perhaps to Ensembl 2021 et al) and/or work streams could aim to publish (brief) papers corresponding to new specifications or major additions to existing ones.

[^4]: And hopefully have the clout to get it published!

Another question to be investigated would be whether a direct DOI identifies e.g. “the SAM format” as a platonic ideal, or whether we would want to mint DOIs corresponding to particular versions of specifications. If the latter, note that “which editions of specifications are the important ones, and how do we refer to them (permanently)” is an ongoing discussion already and has different answers depending on how work streams have organised their work.

jmarshall commented 2 years ago

The following also came up in the slack conversation:

Would you want to reference the doi in the bed file as a header?

IMHO DOIs would not add a lot of value in other contexts. In particular, there would not be any point in writing out a specification DOI in the headers of an e.g. BED file. A file can be determined to be a BED file by other means (perhaps via a magic number in future; today, by recognising the tabular data; via filename extension), so adding an explicit DOI as well as other BED-specific magic number headers wouldn't really add anything. One of the few similar things in other file formats would be XHMTL-style DTDs, which HTML5 has moved away from.

jkbonfield commented 2 years ago

I think this is actually two-fold: author credit and specification document linkage.

If we need to cite a specific version of a standard it ought to aleady be in the file headers (@HD VN:1.6, ##fileformat=VCFv4.3). If it's text in a paper, then referring to the version number should be sufficient. Some journals can be right pains in the neck though and refuse to accept a citation unless it's in their "recognised club". Although I've argued and been successful before in getting RFCs accepted as citations.

For standards with existing papers, the authors almost certainly want their work cited. So for SAM/BAM it's the original paper. For CRAM 3.0 and earlier it'd probably need to be the pre-CRAM EBI paper (Fritz et al) in lieu of anything better and for CRAM 3.1 it'd be my own paper. For BED I guess it's BEDtools. For VCF it'd be Petr's 2011 paper.

However those papers typically are for purposes of tracking author credit and aren't specifically citing the version of the standard being used by the file. Quite often they're way out of date too. Plus if we're looking at this from a credit perspective, often the people doing the legwork now aren't the same ones who originally published. Eg citations to the VCF spec don't give credit to any of the current specification maintainers.

At the very least each standard should probably come with a citation section outlining the preferred mechanism to cite it. This can include past papers, but also should recommend a modern citeable object where they substantially differ in authorship. We ought to lodge a new version every time there is a formal update to the specification version number.

One possibility is looking into protocols.io. I don't know if they specifically accept file formats or network protocols, but it does feel like a good fit to me and they have linkage within their protocols to other protocols, so this feels like a foundation level for them.

michaelmhoffman commented 2 years ago

There is a UCSC Genome Browser paper where BED is originally mentioned, which I treat as the canonical journal article reference for BED. We also describe the drafting of the GA4GH BED standard in the Acidbio paper so one could make a case for citing that too if you are using the GA4GH BED spec.

There are a lot of advantages to using DOIs for a centralized authoritative reference to GA4GH specifications or other documents, without regard to the cultural role of DOIs in academic credit. It’s actually academic credit where thought of this, however—our faculty annual activity report asks for a list of documents (mostly journal articles) with a column for the document DOI. Having one would be very helpful here and in many other contexts. It would also be a lot easier to track citations to DOIs through existing mechanisms, whether they be scholarly citations or altmetrics.

michaelmhoffman commented 2 years ago

Structured CrossRef metadata may be useful to authoritatively record some of the things people are asking about in the GA4GH Connect session on Product Approval. cc @susanfairley

susanfairley commented 2 years ago

Raising this issue with TASC: https://github.com/ga4gh/TASC/issues/39

ianfore commented 2 years ago

The "are DOIs the right solution for identifiers" in the biomedical domain, was addressed as part of the FORCE11 Data Citation Implementation Pilot The group included many known to GA4GH, and involved publishers to ensure the approach was workable from their point of view.

The paper published by the identifiers group is at https://www.nature.com/articles/sdata201829 and an accompanying editorial at https://www.nature.com/articles/sdata201895 .

Another paper outlining the work with publishers is at https://pubmed.ncbi.nlm.nih.gov/30457573/

The use of compact identifiers as an alternate to DOIs might be compared with the DOI approach above.

The work of the group was driven by the Joint Declaration of Data Citation Principles (JDDCP)h

jmarshall commented 2 years ago

The paper published by the identifiers group is at https://www.nature.com/articles/sdata201829 and an accompanying editorial at https://www.nature.com/articles/sdata201895 .

These are both about sets of data as deposited in data repositories. They don't appear to discuss standards specification documents, which may have their own considerations that may be different from the considerations for data sets.

jkbonfield commented 2 years ago

Data citation is key to reproducible science, but that is a different nuance to assigning credit where it's due which is the traditional role of paper citations. Unless there is evidence that grant funding agencies are tracking data citations as well as document citations, then it seems to not be a good fit.

Specifically in the past we have had comments from people working on GA4GH specifications that the time they can contribute is limited because they're in an academic position where their "worth" is judged by funding agencies on paper citations and if work isn't towards a paper then it won't be valued or judged by the people that ultimately pay their wages. This is a rather stark and sad state of affairs which I wish wasn't so short-sighted, but it can be a real problem for some people.

So my ideal would be to ensure that everyone working on updating GA4GH specifications can do so in the knowledge that significant contributions will lead to citations that may be acceptable to grant funding agencies. There is potentially a conversation to be had with them, but the easiest path is a minimal journal submission. Something that's not a fully fledged peer review with months of round-trips faffary, and more along the lines of deposit and get a DOI. Perhaps the "publish immediately with subsequent open/on-going peer-review" model fits; more of a social network style of publishing, or something specific such as protocols.io (hard to see how that integrates with a file format though which could be used in any number of ways).

ianfore commented 1 year ago

Reposting some things I added in chat today in the TASC call.

One of the identifier types created by some of those involved with the FORCE11 effort is RRIDs. Research Resource Ids. They got some traction with publishers. Ideas behind RRIDs is to make citable any resource used by a researcher. They have cell lines, plasmids, anitibodies and organisms. And "Tools and Resources" which are where standards might sit. This is the RRID portal https://scicrunch.org/resources

Using RRIDs here’s how one would cite samtools RRID:SCR_002105 This is cram as an RRID. RRID:SCR_012975 The full page for that RRID is https://scicrunch.org/resources/data/record/nlx_144509-1/SCR_012975/resolver?q=cram%20format&l=cram%20format&i=rrid:scr_012975 There's even an infrastructure for to log in and claim the resource.

Clearly some ambiguity for the cram example. Is it referencing the format or the toolkit? So probably less than a perfect solution out of the box. But how can we do a “yes, and” with it? I.e. build on it rather than starting from ground-zero ourselves.

Also their idea of a "tool/resource" doesn't appear to have encompassed standards Here's are the search facets you get for tool/resource tool-resource facets

Standards not explicitly there. But service resource is there which is close. That would probably need disambiguation of the service spec and an instance/implementation of the spec.

So does one walk away if they haven't addressed standards? Or, look at how what they have done might be applied to our problem. Or ask them about their thoughts and if they have looked at this.

Open questions Has the uptake of RRIDs by authors and publishers been sustained? Is an approach where ids are brought together in a curated database like RRID scaleable? Or references to distributed resources as identifiers.org does?

jkbonfield commented 1 year ago

When we submitted our Samtools-update, Bcftools and Htslib papers last year to GigaScience it was a hard requirement that we also submit the RRID numbers. It turned out these already existed, so we just used them, but I shared your concerns about the quality of the meta-data being listed.

I don't think we bothered to login and claim ownership though as it was a bit of a can of worms we didn't have time to open.

ianfore commented 1 year ago

Great to hear a real experience with it. Thanks for sharing @jkbonfield . At one level this addresses "Has the uptake of RRIDs by authors and publishers been sustained?" A publisher sustained it, the author (samtools team) had to comply. Very organic! It will likely ever be thus.

Gigascience was always along for the ride with people involved in the FORCE11 and Research Data Alliance work on citation. We need these ground breakers, and to sustain and support one another in best practices like this.

michaelmhoffman commented 1 year ago

A standard is not a "tool" or "resource" like a piece of software or a cell line. A standard is a document.

There is a long history of citing standards as documents. There are even normative international and national standards that specify how a standard is to be cited as a document, in a bibliography. For example, ISO 690[^1] describes this in section 8.11.4, "Standards" which is a subdivision of 8.11, "Reports in series and similar information resources". National Information Standards Organization standards have an ISSN and an ISBN.

In my experience, RRIDs are not used for documents, because we have other ways to cite documents and other identifiers for them. In particular, when RRIDs are used, it can replace a traditional citation to a document in the bibliography. Encouraging this would be an error here and decrease the visibility of GA4GH standards.

GA4GH should not be introducing novelty to citing standards documents by RRID. We should instead use the established ways people identify documents.

[^1]: INTERNATIONAL ORGANIZATION FOR STANDARDIZATION (ISO). ISO 690:2010, Information and documentation — Guidelines for bibliographic references and citations to information resources.[^2] [^2]: As an example, this is the bibliography entry for ISO 690, in ISO 690 format, copied from the example in the ISO 690:2010 document.

jkbonfield commented 1 year ago

I think the real issue is how to easily generate documents that can be cited in the traditional ways. I agree RRID may not be appropriate for documentation. Publishing in mainstream journals is seriously hard work, taking months of time up. Having something gnarly to periodically wade through like that can be a barrier to getting new people on board. There are however more "light-touch" journals that publish immediately and have more social-media style ongoing review. Some are even dedicated to minimalist work, such as getting a citeable DOI for a new software release. They may not be directly suitable to a standards body though.

If GA4GH has it's own official mechanism or a collaboration with someone doing a similar job, for generating permanent citeable DOIs related to each specification version, then they could be officially cited as documents along with getting credit to the authors (and accounting for ongoing turnover of people involved). This doesn't remove the ability to publish in mainstream journals if people wish - it's just a choice people can make.