Open jhnwllr opened 5 years ago
New blog post on the gbif data blog outlines some of the problems with this type of data:
https://data-blog.gbif.org/post/gbif-molecular-data-quality/
Thanks for the suggestions. Yes, these genomic data can be problematic. I am not sure if we should add a separate function for this, since the meta-data are probably the best way to address this problem. For instance the "IndividualCount" information provided with GBIF data can be very helpful! Are youa ware of a list of all providers in gbif that provide metagenomics data?
This issue is discussed more here: https://discourse.gbif.org/t/metagenomics-and-metacrap/1583/13
This issue has somewhat been solved on the GBIF-side, but "the problem" will likely continue to get worse.
Background
GBIF has recently begun publishing records from a metagenomics publisher MGnify. https://www.gbif.org/publisher/ab733144-7043-4e88-bd4f-fca7bf858880
Typically these records can be bacteria or other microbes. Often however these records can be trace DNA of some plant, animal, insect or something else.
https://www.gbif.org/occurrence/taxonomy?publishing_org=ab733144-7043-4e88-bd4f-fca7bf858880
Problems
Solutions
cc_metagenome()
that simply filters out datasets published by MGnify or other metagenomics publishers.Organism quantity
andSample size value
to judge the quality of resulting taxon label example but this solution probably would need expert input.