Closed scottgigante closed 1 year ago
Should they indeed be preprocessed? I thought that was still up for debate and @M0hammadL and @dburkhardt were generally against this?
My personal feeling is that the data on figshare should take two forms.
opensproblems.data
loaders and stored in adata.X
with raw counts in adata.layers["counts"]
if people want to get fancy.Okay... that's possible.
We should automate this task in the deployment section of Travis.
Closes #1
Now the normalization functions are working, I can create a script that builds these each time we release. Only question is where do we upload them and how? If figshare only takes 5G then including the normalized data might take it over the top.
Great! Sorry that it was so difficult to use them... scIB
is still under development for more user friendliness ^^.
What is the plan to run all methods for benchmarking? I assume we would do this in a script... Would it work in the script that runs the methods and metrics to just keep the object stored locally wherever this run is done? Or is the plan to put it on AWS?
The evaluation currently takes place on Travis whenever we push a tagged commit. It commits the result (a markdown file) to master. Since the output is so small it's not a problem for storage on Github, but the data itself does not have that advantage.
If it takes place on Travis, wouldn't the server need to be able to keep the object in memory anyway when we're running the methods on it? Then we could also normalize and keep that in memory before running all the methods for benchmarking, no?
The problem with size is not memory but storage -- figshare has a 5GB limit.
We could normalize in advance and keep in memory for testing, but there's no reason to do so given normalization doesn't take long and this would prohibit us from testing the latest version of the code should any of the normalizations change.
I mean that we could have the unnormalized data stored on Figshare, then normalize after downloading every time we run the benchmarking script. No storage of normalized data is needed then (except for the sandbox datasets i guess). Or are we talking about the sandbox datasets the whole time?
Yes, I'm exclusively talking about the sandbox datasets here.
ah, okay. Then i'm less confused. Yes, we need to think about subsampling more then. We would anyway remove the test data, so would be less... and then the question is how big is another sparse matrix in adata.layers
It probably increases the file size by ~100% every time we add another layer. One thing we could consider is only storing the normalizations with complex dependencies (e.g. scran) and allowing folks to compute the others.
I have a hard time believing that a dataset cleaned of .obsm
and .obsp
entries can be larger than 5GB with 4 layers (if it's < 1 million cells). But I might be wrong...
Mohammed apparently had this issue with a single layer of Muris/Senis.
On Wed, 5 Aug 2020 at 11:56, MalteDLuecken notifications@github.com wrote:
I have a hard time believing that a dataset cleaned of .obsm and .obsp entries can be larger than 5GB with 4 layers (if it's < 1 million cells). But I might be wrong...
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/singlecellopenproblems/SingleCellOpenProblems/issues/2#issuecomment-669275589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX3VKJG6BGMJCDE6AE3R7F6LPANCNFSM4PCOVG5Q .
I'm pretty sure if you try hard @M0hammadL, you will get it slimmer 😝
Datasets should be