Memory usage with huge datasets

maximilian-heeg commented 1 year ago

Hi,

Thank you so much for providing Baysor. I recently updated my installation to version 0.6.2, and it is running great with Julia 1.9.

In our lab, we have recently generated new (huge) spatial datasets with up to 250 million transcripts (using a 500 gene panel), and we were planning to use Baysor for cell segmentation. I was expecting that this requires a lot of memory, so I did some benchmarking with smaller FOVs of the dataset (see below).

Memory usage and transcripts

It seems to that, that memory use scales linear with the number of transcripts. Extrapolating this, I would assume that our dataset with 250 million transcripts requires approximately 5-6 TB memory (which I unfortunately don't even have on our HPC).

Are there any solutions to that? Is there an easy way of creating smaller tiles and stitching them back together? I think, with the increasing panel sizes and imaging areas of commercial solutions, this might become an important limitation for many users soon.

Any help/ideas/suggestions are greatly appreciated.

Max

sebgoti commented 1 year ago

A bit unrelated question to the issue but may I ask @maximilian-heeg how is your lab using Baysor on the HPC? I am trying to run it using Singularity to avoid installing things at the HPC level, so far not lucky (even though the docker container works). Thanks and sorry for any spam to your issue!

VPetukhov commented 1 year ago

@maximilian-heeg , thank you for this test! It's indeed a problem. We're working on memory optimizations for v0.7.0, and if it works as expected, it should drastically reduce the memory size (10 folds or so).

As for tiling, we also plan to add this graph cut idea, but it's not there yet. So the only thing you could atm is manually split the data by FOVs.

VPetukhov commented 1 year ago

@sebgoti , a short answer: I didn't try Baysor with Singularity. We have our lab servers, which are just big singular machines, so no clusters. If you need some input on your situation, I'd be happy to continue the discussion in a separate issue.

maximilian-heeg commented 1 year ago

@sebgoti I have tried to run the Docker container using Singularity, but that did not work for me on the HPC. I ended up installing juliaup in a conda environment and then building baysor as described in the Readme. Viel Erfolg!

@VPetukhov Thank you so much for the answer and your work on this. For us, getting a good segmentation is currently the bottleneck of processing spatial data. I will try to split it into multiple FOVs.

cbiagii commented 1 year ago

@VPetukhov, you say split the data by FOVs using the fov_name column of the transcripts.csv.gz file?

mjleone commented 8 months ago

@VPetukhov Hello, I and members of my lab are also very curious about if the new release is still in progress, and expected release time if you know. We are working with data between 10 million and 25 million transcripts, and hard to use current version with our resources

kharchenkolab / Baysor

Memory usage with huge datasets #87