Grey literature section or instance

User Story (TODO)

As I want So that

Acceptance Criteria

Is your feature request related to a problem? Please describe. "Grey literature" means things that are NOT reusable "data" and are NOT part of the "narrative" i.e. the stuff that doesn't really fit in either GigaByte nor in GigaDB. It may also include Peer Reviews.

Describe the solution you'd like We want to get a consensus from the team on whether a GreyLit GigaDB instance should be created? Whichever is chosen as the best solution needs to be fully specced up following the scrum based guidance on story writing from Rija. The basic idea of the GreyLit GigaDB would be to host datasets of specific items that are not data, but that are not appropriate as part of the narrative, this could include Reviews. The same database structure as GigaDB could be easily reused, there should be little or no requirement for the curator steps in the process making it more like a FigShare bucket of "stuff" with a fairly small size limit to stop people putting data files in there that should be in GigaDB!

Describe alternatives you've considered There are 5 options to be considered: 1 - Allow supplemental files in GigaByte the same as most other journals. Pro- its simple and common place so everyone will understand it Con- Its not possible to index or find any information hosted in supplemental files, its often not checked properly by reviewers so could allow back-door to publish less confident/stringent results/opinions. Con- GigaByte is designed to be different, to be the future to do things properly not just easily.

2 - Allow any files in GigaDB datasets: Pro - its simple to implement Pro - no need to define GreyLit vs Data (all files that dont fit in GigaByte just go in GigaDB) Pro - It keeps all the associated information in 1 place Con - It bloats GigaDB and dilutes the well curated and tended data we currently host Con - Reviews in particular are not "data" and cannot be included with the associated dataset due to authorship, so would need to have their own datasets

3 - Include grey lit in GigaDB but create a separate section for it (see #78 ) Pro - allows us to keep only 1 instance of GigaDB so simplifies admin aspect Pro - allows users to include/exclude, data/greyLit from their queries Con - bloats GigaDB Con - reduces curator control on what is appropriate in GigaDB (maybe this is a Pro too?!!) Con - requires thought and documentation on which files are greyLit vs Data

4 - Creat an entirely new instance of GigaDB specifically for grey lit Pro - keeps the GreyLit separate from data Pro - allows changes to be made to the infrastructure to better suit no-curation datasets Pro - allows GigaByte to point to individual DOIs of supplemental objects such as large tables or interactive charts Con - requires thought and documentation on which files are greyLit vs Data Con - potentially means 1 paper has multiple associated datasets (1 GigaDB + >1 GreyLit) Con - Means multiple (potentially diverging) platforms for developers to maintain

5 - some combination of 3&4 where the underlying platform is united, but the display is separated Pro - allows us to keep only 1 instance of GigaDB so simplifies admin aspect Pro - allows users to include/exclude, data/greyLit from their queries Pro - keeps the GreyLit separate from data Pro - allows GigaByte to point to individual DOIs of supplemental objects such as large tables or interactive charts Con - requires thought and documentation on which files are greyLit vs Data Con - potentially means 1 paper has multiple associated datasets (1 GigaDB + >1 GreyLit) Con - bloats GigaDB (but mitigated by making the user perspective look like 2 separate things) Con - probably requires a lot of developer time to allow dual workflow for 2 different sections

Some specific examples of GreyLit and the issues it causes:

Ex1. https://gigabytejournal.com/articles/9
Table 1 in the above manuscript has too many rows to present it nicely within GigaByte. The information in the table is just a summary of the various data that are hosted in GigaDB already so not required or wanted as a data table in GigaDB, but in terms of length, the manuscript is ~25% Table 1!

Ex2. https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100829/STORMS_Checklist_2020.09.15.report.pdf The STORMS checklist is a useful bit of information but is directly related to the manuscript rather than the data, so it doesn't really fit in GigaDB, but equally, it's not part of the narrative so doesn't fit in the manuscript itself. We have hosted it in GigaDB because we don't really have an alternative yet.

Ex3. https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100302/Documents/Approval_Human_Genetic_Resources_Administration_of_China_HGRAC_for_group1_samples.pdf This PDF showing the Approval from the Human Genetic Resource Administration of China to share the data is not appropriate in the manuscript, but should be referenced directly from the manuscript. As it is directly related to the data I think it is OK to have it in GigaDB, but would equally fit nicely in a grey lit archive and be linked to, from the data.

Ex4. https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100835/IRB_approval.pdf Similar to the release approval above, this IRB approval document would be well suited to having its own DOI in a greyLit archive

Ex5. https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100272/manual%20of%20drVM.pdf and https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100698/supplementary_package/data/segata_2017/get_curatedMetagenomicData_2018_08_16.html Additional methodology like this is extremely useful information, but not data, and too in-depth/fine detail for the narrative manuscript. It represents a fairly common sort of thing that is included in supplemental files. For informatics methodology, it is now more commonplace for those to be included in a readme.md file within a GitHub repository, so we see less of them as individual files now.

Ex6. https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100436/MS_supplemental_files/Additional_file_32.pdf There are times when the author needs to create composite figures from image data and other information to enable interpretation, normally these would be the sort of thing to be included in the manuscript as narrative with the individual parts of the figure in GigaDB as data. This dataset includes MANY of these figures, too many to be included in a GigaByte manuscript but they do offer better insights to the interpretation of the data than just the individual data points so have some value as supplemental files. Therefore would probably be a good candidate for GreyLit archive?!

NB there was previous discussion on the topic of how we handle supplemental files in GigaByte recorded here

I like the concept of "Allow any files in GigaDB datasets" as it allows us to deliver the pitch, "Publish with us and we can offer one terabyte of storage".

If we do not allow authors to submit files of their choice, then alternative repositories such as FigShare, Dryad, or Zenodo may be more appealing.

This item needs to be prioritised due to an imposed hard deadline of Jan 1st 2022. By then we must have some option to host Reviews somewhere as Publons will no longer support our usage (without a large fee).

I believe option 3 is probably the best option: 3 - Include grey lit (reviews) in GigaDB but create a separate section for it

We can refine the story to be specific to reviews, the database schema should be able to handle this simply by adding a dataset_type value of "review", then creating a slightly different dataset-page view for datasets of type=review.

We would need to clearly specify that slightly altered dataset page view, what exactly needs to be shown and how? then we will need to clearly define how the review details are going to be added to GigaDB; from where? do the authors submit them directly to GigaDB? how do we get the user(review) personal information to create an account for them, will there be any option of anonymous reviews?

Update Peer Reviews are being dealt with separately as there are fundamental differences with the structure and metadata that mean it will be easier in a separate system, see project page.

I have also been giving it more thought and asked myself the question

What is the main reason for not wanting the grey-lit WITHIN the dataset? The simple answer is control of data quality. By allowing users to upload anything we risk diluting the level of reproducibility and reusability of files. So instead of splitting out the grey-lit from GigaDB maybe we can include it, BUT also include specific automated checks on things that are not considered "reusable" files with a flag to curators to investigate them. The automated checks would be to alert submitters to inappropriate use of Excel files instead of CSV, where we can automate some checks on content to see if there are multiple tabs and what size the sheets are. PDF files should be flagged to authors with suggestions of alternatives, and flagged for curators to check. For methods documents ideally, we should have in place a means to assist submitters with creating protocols.io entries from their supplemental methods documents. For things like phylogenetic trees, we need an online method to display them within GigaDB so that authors can just provide the tree-file and underlying alignments instead of the pdf image generated from those. For things like Ex1 above, large tables that need to be cited directly from within a manuscript we need the ability to point directly to a file via a stable URI (or DOI). I think, all those things should be in place before we start accepting grey-lit in GigaDB. Otherwise, the precedent gets set and people will use the early examples of grey-lit files getting in as evidence that they can just dump whatever they like in GigaDB. Or maybe I'm just too controlling?!

gigascience / gigadb-website

Grey literature section or instance #565

User Story (TODO)

Acceptance Criteria