sparsity analysis of NCBI Biosamples via relational database created by this repo

turbomam commented 6 months ago

@ssarrafan @emileyfadrosh and others from NMDC went through milestones and priorities and wanted to follow up something that's not a specific milestone but is a priority for Emiley's talk in June.

The request is for @turbomam to summarize the relational-structured NCBI Biosamples database with the goal of showing how much data is there and how sparse the metadata is and how much more valuable the data would be with metadata.

[ ] How much data is in there
- just a row count of Biosamples, or broken out by packages/checklists/whatever?
[ ] sparsity
- presumably we wouldn't want to consider "missing", empty strings etc useful values, and therefor replace them with NULL before calculating sparsity. There are lots of string that indicate missing values in there, so we should probably prioritize the most common ones.
[ ] what about metadata that isn't in a machine actionable format?

emileyfadrosh commented 6 months ago

Thanks, @turbomam! To add a few more details on some ideas I have had (please let me know what is/isn't feasible!):

For how much data, is it possible to have not just how many counts of biosamples, but also the total amount of sequence data in petabytes/petabases? Ideally we can say: across these hundreds of thousands of metagenomes that amount to 2 (?) petabases of sequence data....
For sparsity, is it possible to say: biosamples (from the hundreds of thousands above) ONLY have latitude and longitude? and ____ biosamples have depth or biosamples have pH or some other dramatic number of samples that do not have anything beyond latitude/longitude.
this could be a funny punchline: is there a particularly egregious example (sentimeters?!) or another non-machine actionable example that would be both silly and instructive?

Thanks! I am SUPER excited about this effort :) also if @cmungall has ideas, very welcome to input here!

turbomam commented 6 months ago

idea number 1 is great but I'm probably not the right person to do it, beyond giving a list of Biosample and SRA ids and or accessions

turbomam / biosample-xmldb-sqldb

sparsity analysis of NCBI Biosamples via relational database created by this repo #42