openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
319 stars 79 forks source link

Create sandbox/development versions of datasets #2

Closed scottgigante closed 1 year ago

scottgigante commented 4 years ago

Datasets should be

  1. In h5ad format
  2. Preprocessed
  3. Ground truth for evaluation hidden
  4. Documented in README for method developers to easily download and use to test their methods.
LuckyMD commented 4 years ago

Should they indeed be preprocessed? I thought that was still up for debate and @M0hammadL and @dburkhardt were generally against this?

scottgigante commented 4 years ago

My personal feeling is that the data on figshare should take two forms.

  1. Raw counts and metadata only, for testing and evaluation by this repo. One file per dataset. Preprocessing should be done in the opensproblems.data loaders and stored in adata.X with raw counts in adata.layers["counts"] if people want to get fancy.
  2. Processed for method/metric development. One file per task per dataset. Contains raw and preprocessed data as above, but any metadata associated with the truth (w.r.t. the task). We're not able to prevent developers from cheating, but we can make not cheating the easiest thing to do.
LuckyMD commented 4 years ago

Okay... that's possible.

scottgigante commented 4 years ago

We should automate this task in the deployment section of Travis.

openproblems-bio commented 4 years ago

Closes #1

scottgigante commented 4 years ago

Now the normalization functions are working, I can create a script that builds these each time we release. Only question is where do we upload them and how? If figshare only takes 5G then including the normalized data might take it over the top.

LuckyMD commented 4 years ago

Great! Sorry that it was so difficult to use them... scIB is still under development for more user friendliness ^^.

What is the plan to run all methods for benchmarking? I assume we would do this in a script... Would it work in the script that runs the methods and metrics to just keep the object stored locally wherever this run is done? Or is the plan to put it on AWS?

scottgigante commented 4 years ago

The evaluation currently takes place on Travis whenever we push a tagged commit. It commits the result (a markdown file) to master. Since the output is so small it's not a problem for storage on Github, but the data itself does not have that advantage.

LuckyMD commented 4 years ago

If it takes place on Travis, wouldn't the server need to be able to keep the object in memory anyway when we're running the methods on it? Then we could also normalize and keep that in memory before running all the methods for benchmarking, no?

scottgigante commented 4 years ago

The problem with size is not memory but storage -- figshare has a 5GB limit.

We could normalize in advance and keep in memory for testing, but there's no reason to do so given normalization doesn't take long and this would prohibit us from testing the latest version of the code should any of the normalizations change.

LuckyMD commented 4 years ago

I mean that we could have the unnormalized data stored on Figshare, then normalize after downloading every time we run the benchmarking script. No storage of normalized data is needed then (except for the sandbox datasets i guess). Or are we talking about the sandbox datasets the whole time?

scottgigante commented 4 years ago

Yes, I'm exclusively talking about the sandbox datasets here.

LuckyMD commented 4 years ago

ah, okay. Then i'm less confused. Yes, we need to think about subsampling more then. We would anyway remove the test data, so would be less... and then the question is how big is another sparse matrix in adata.layers

scottgigante commented 4 years ago

It probably increases the file size by ~100% every time we add another layer. One thing we could consider is only storing the normalizations with complex dependencies (e.g. scran) and allowing folks to compute the others.

LuckyMD commented 4 years ago

I have a hard time believing that a dataset cleaned of .obsm and .obsp entries can be larger than 5GB with 4 layers (if it's < 1 million cells). But I might be wrong...

scottgigante commented 4 years ago

Mohammed apparently had this issue with a single layer of Muris/Senis.

On Wed, 5 Aug 2020 at 11:56, MalteDLuecken notifications@github.com wrote:

I have a hard time believing that a dataset cleaned of .obsm and .obsp entries can be larger than 5GB with 4 layers (if it's < 1 million cells). But I might be wrong...

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/singlecellopenproblems/SingleCellOpenProblems/issues/2#issuecomment-669275589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX3VKJG6BGMJCDE6AE3R7F6LPANCNFSM4PCOVG5Q .

LuckyMD commented 4 years ago

I'm pretty sure if you try hard @M0hammadL, you will get it slimmer 😝