Missing `scripts/data_preparation/`

theislab / ncem_benchmarks

BSD 3-Clause "New" or "Revised" License

8 stars 2 forks source link

Missing `scripts/data_preparation/` #1

Closed hiraksarkar closed 2 years ago

hiraksarkar commented 3 years ago

Hi,

I am not able to find the directory data_preparation as described in README. It would be really helpful if it can be added.

Thanks

AnnaChristina commented 3 years ago

Hi @hiraksarkar ,

thanks for mentioning! Please find the directory for data preparation here.

I added the respective notebook for the MERFISH - brain dataset.

We moved data preparation to notebooks, as this is currently only required for the MERFISH - brain dataset. For the remaining datasets no preprocessing was applied. Happy to link you to the original datasets or assist further if needed.

hiraksarkar commented 3 years ago

I left a comment on the Merfish dataset front in the tutorial issue (It seems without the CSV we can not run NCEM unfortunately). For other datasets it's not entirely clear to me which files should be downloaded and how to provide them to NCEM. Apologies if there is already a tutorial that I have missed.

For example, in case of "CODEX cancer" data, the actual paper leads to a dataset directory. It contains around 2 TB of processed images. Should I download that ? Given that directory I am not sure how do I call NCEM on that. The merfish data should be more straightforward given I can obtain the metadata.csv file.

Thanks

AnnaChristina commented 3 years ago

Hi @hiraksarkar, I added more detailed instructions to the README of ncem_tutorials and ncem_benchmarks on how to access the public datasets.

For the CODEX cancer dataset, only the single-cell data is required to test ncem, which is stored in a different directory with 213 MB. So no need to download the 2TB of images.

To run tutorials or data exploration, simply store the files in a directory of your choice and adjust the datadir in the respective notebooks. We stored datasets in folders named by first author. If you follow a different convention, also adjust data_path whenever ncem's get_data function is called.

Thanks for your comments, we are still enhancing the usability of ncem, so any feedback if highly appreciated.

hiraksarkar commented 3 years ago

Hi @AnnaChristina amazing, really appreciate the help, trying this now. I have some questions about the manuscript, should I just email them if that's possible. Again, thanks a lot for helping.

AnnaChristina commented 3 years ago

Sure! Please feel free to raise additional issues whenever needed. Yes, just email me and @davidsebfischer. Happy to discuss and answer!

hiraksarkar commented 3 years ago

Hi @AnnaChristina ,

Just wanna mention the dataloader assumes scMEP_MIBI_singlecell.csv is inside a directory scMEP_MIBI_singlecell and not the zipped file.

Except that it is working fine, although I am using same variable values as mentioned for Zheng tutorial.

AnnaChristina commented 3 years ago

Hi @hiraksarkar,

Yes. The dataloader expects certain folder structures. We will enhance this in the future.

Great. You can basically run the tutorial for each dataset with similar parameters. For more explanation on how to set specific parameters (e.g. radius), you can check the data exploration notebooks and the manuscript.