viralemergence / clover

Four-in-one value pack of host-virus association data
21 stars 4 forks source link

suggest to include NCBI taxon ids and more #4

Open jhpoelen opened 3 years ago

jhpoelen commented 3 years ago

Hi!

I was just made aware of your exciting mammal-virus association aggregate dataset sourced from existing datasets (see https://github.com/globalbioticinteractions/globalbioticinteractions/issues/585 ).

As I was working reviewing your impressive work, the following thoughts/ideas came to mind:

  1. In https://github.com/viralemergence/clover/blob/main/output/Clover_v1.0_NBCIreconciled_20201218.csv , many of the virus and host taxa have been resolved against the NCBI taxonomy. However, the NCBI taxon ids for host and virus are not included. Did you consider adding these resolved taxon identifiers (e.g. NCBI:txid9606 for homo sapiens) in separate columns like virusNameId and hostNameId . I think this would not only help downstream workflows, but would also be consistent with NCBI's citation guidelines.
  2. You cite the datasets you reused in fields Database and DatabaseVersion, however, no full citation or DOI is provided. Did you consider adding a DatabaseDOI and/or DatabaseCitation to help others retrace the provenance of your host-virus association claims?
  3. re: filename https://github.com/viralemergence/clover/blob/main/output/Clover_v1.0_NBCIreconciled_20201218.csv - you've included version information inside the filename (e.g., v1.0 and 20201218) even though you are use a git as version control. If you leave out this information, others might have an easier time to re-use your data in the future (e.g., no need to update R scripts when you release a new version). Also, I wonder whether _NBCIreconciled_ was meant to be _NCBIreconciled_ (notice NBCI -> NCBI)

Hope this helps and curious to hear your comments / thoughts.

cjcarlson commented 3 years ago

Thanks so much @jhpoelen - some of these are already on our to-do list and others will be clearer with the incoming preprint! I'll leave this open until the rest are more addressed

jhpoelen commented 3 years ago

@cjcarlson Thanks for responding and good luck with getting your publication out there!

PS. If you'd like to have your dataset indexed directly by GloBI, please let me know and I can prepare a pull-request with some index configuration (e.g., schema mapping, citation info). Note that Shaw et al. (https://github.com/liampshaw/Pathogen-host-range/pull/3) and Urban et al. (https://github.com/PHI-base/data/pull/2) accepted such pull requests in the past to keep their indexed data up-to-date.