microbiomedata / issues

public repo for issues related to NMDC work
1 stars 0 forks source link

sparsity analysis of NCBI Biosamples via relational database created by this repo #690

Open turbomam opened 2 months ago

turbomam commented 2 months ago

To move from https://github.com/turbomam/biosample-xmldb-sqldb/issues/42

@ssarrafan @emileyfadrosh and others from NMDC went through milestones and priorities and wanted to follow up something that's not a specific milestone but is a priority for Emiley's talk in June.

The request is for @turbomam to summarize the relational-structured NCBI Biosamples database with the goal of showing how much data is there and how sparse the metadata is and how much more valuable the data would be with metadata.

Copying from @emileyfadrosh

Thanks, @turbomam! To add a few more details on some ideas I have had (please let me know what is/isn't feasible!):

  1. For how much data, is it possible to have not just how many counts of biosamples, but also the total amount of sequence data in petabytes/petabases? Ideally we can say: across these hundreds of thousands of metagenomes that amount to 2 (?) petabases of sequence data....
  2. For sparsity, is it possible to say: biosamples (from the hundreds of thousands above) ONLY have latitude and longitude? and ____ biosamples have depth or biosamples have pH or some other dramatic number of samples that do not have anything beyond latitude/longitude.
  3. this could be a funny punchline: is there a particularly egregious example (sentimeters?!) or another non-machine actionable example that would be both silly and instructive?

Thanks! I am SUPER excited about this effort :) also if @cmungall has ideas, very welcome to input here!

turbomam commented 2 months ago

idea number 1 is great but I'm probably not the right person to do it, beyond giving a list of Biosample and SRA ids and or accessions