@ssarrafan @emileyfadrosh and others from NMDC went through milestones and priorities and wanted to follow up something that's not a specific milestone but is a priority for Emiley's talk in June.
The request is for @turbomam to summarize the relational-structured NCBI Biosamples database with the goal of showing how much data is there and how sparse the metadata is and how much more valuable the data would be with metadata.
[ ] How much data is in there
just a row count of Biosamples, or broken out by packages/checklists/whatever?
[ ] sparsity
presumably we wouldn't want to consider "missing", empty strings etc useful values, and therefor replace them with NULL before calculating sparsity. There are lots of string that indicate missing values in there, so we should probably prioritize the most common ones.
[ ] what about metadata that isn't in a machine actionable format?
Copying from @emileyfadrosh
Thanks, @turbomam! To add a few more details on some ideas I have had (please let me know what is/isn't feasible!):
For how much data, is it possible to have not just how many counts of biosamples, but also the total amount of sequence data in petabytes/petabases? Ideally we can say: across these hundreds of thousands of metagenomes that amount to 2 (?) petabases of sequence data....
For sparsity, is it possible to say: biosamples (from the hundreds of thousands above) ONLY have latitude and longitude? and ____ biosamples have depth or biosamples have pH or some other dramatic number of samples that do not have anything beyond latitude/longitude.
this could be a funny punchline: is there a particularly egregious example (sentimeters?!) or another non-machine actionable example that would be both silly and instructive?
Thanks! I am SUPER excited about this effort :) also if @cmungall has ideas, very welcome to input here!
To move from https://github.com/turbomam/biosample-xmldb-sqldb/issues/42
@ssarrafan @emileyfadrosh and others from NMDC went through milestones and priorities and wanted to follow up something that's not a specific milestone but is a priority for Emiley's talk in June.
The request is for @turbomam to summarize the relational-structured NCBI Biosamples database with the goal of showing how much data is there and how sparse the metadata is and how much more valuable the data would be with metadata.
Copying from @emileyfadrosh
Thanks, @turbomam! To add a few more details on some ideas I have had (please let me know what is/isn't feasible!):
Thanks! I am SUPER excited about this effort :) also if @cmungall has ideas, very welcome to input here!