giovp / spatialdata-sandbox

GNU General Public License v3.0
9 stars 14 forks source link

Mcmicro #6

Closed melonora closed 1 year ago

melonora commented 2 years ago

If required I can also just store the image data in h5ad and then ensure automatic deletion of files that are not required.

giovp commented 2 years ago

@melonora thanks for contributing to this dataset. I think it's way too big for our purpose, is there a way to just get couple of slides for testing purposes, instead of the full dataset?

melonora commented 2 years ago

The main files taking up storage are the registration ome.tif files. Specifically for the cycif and mIHC data. If we just go for the CODEX data this would still be a file of 11GB. It is a multi channel ome.tif so it would have to be downloaded first. If it is no problem to initially download the data I can subset it or convert it to an image format that takes up less space.

giovp commented 2 years ago

that's a good point, I think it'd be optimal to share a resized/small chunk of the dataset so pre-processing would be useful. Would you be able to share it via some cloud ? Otherwise what do @LucaMarconato thinks?

giovp commented 2 years ago

or maybe worth starting with codex first? saw your comment here, maybe smaller/easier to share?https://github.com/giovp/spatialdata-sandbox/issues/1#issuecomment-1166993195

LucaMarconato commented 2 years ago

I think 11 GB is a good size: not too big but still providing a real use case for lazy loading and dask processing. But I would also write a script that gives a smaller version of the data. If this gets complex, we can wait to write this script using SpatialData APIs as soon as they are ready.

Re the storage: if you can download the 11 GB dataset only I would not use cloud storage, otherwise yes (and in that case, if the code is not too verbose, I would upload directly a small version).

melonora commented 2 years ago

Just the CODEX dataset is possible. I will adjust the code.

melonora commented 2 years ago

Solely the CODEX dataset will be downloaded now when you run the cli. I am currently checking the size of the TMA as I believe that is also fairly large, but would be easier to subset

giovp commented 2 years ago

@melonora what' the status on this? also, can you please check any of the mibitof, merfish or nanostring datasets to see how the folder structure and python executables should look like? It'd be also great if you could take a stab at writing the write_zarr.py file for writing the data according to correct current ngff/spatialdata specs. Happy to help out with that.

Let me know!

melonora commented 2 years ago

The TMA data does not provide other data than CODEX. It also does not provide the raw data but rather data that is already dearrayed (one tif per array). Since I did not see extra use of this data over the data that is already there I left it out.

I will get started on the zarr script

melonora commented 2 years ago

Let me know if you agree. If not I can still include the TMA data

melonora commented 2 years ago

Regarding data of Cellzome, a MTA is being drafted that will be send to Fabian among others. Once signed I should be able to share ISS data and post ISS mIF data.

giovp commented 2 years ago

Since I did not see extra use of this data over the data that is already there I left it out. I will get started on the zarr script

I think it makes sense, thank you! I think the overall scope is to include as many techs as possible, even if they are redundant in terms of "diversity" (e.g. fish-based methods are all alikes, as well as mIHC etc.).

another important thing (didn't get it from previous messages) is that the data can be stored/managed in a laptop so anything > 5GB should be filtered/cropped for ease of use.

Thanks again!

giovp commented 1 year ago

hi @melonora , we have an mcmicro example now merged in main, is this still needed or can be closed? not sure if it's the same example, we took it from their CI in the mcmicro github repo

melonora commented 1 year ago

This one can be closed