About Multithreading / Parallelization

arundasan91 commented 2 years ago

Hello!

Thank you very much for open-sourcing Baysor! This will help the scientific community for years to come.

I was running Baysor in our inhouse datasets and found that it takes a significant amount of time to run - hours. I also noticed that Baysor doesn't do parallel processing by default. It always attaches to one thread. Is there a way to parallelize the job?

I saw that in an old commit you had JULIA_NUM_THREADS=10; baysor run -n $JULIA_NUM_THREADS as an option. I was wondering if this is still valid. Do you have plans to include parallelism in the pipeline OR am I missing something in the API. https://github.com/kharchenkolab/Baysor/tree/0f08daedfb1b909660df278eebe9de650954ff0c

Thank you, Arun

arundasan91 commented 2 years ago

I saw this from your Docker repo: https://hub.docker.com/r/vpetukhov/baysor

n-frames determines parallelization level. To allow milti-threaded processing of the data, all the space is splitted in n_frames parts, which contain approximately equal number of molecules. Results can be a bit inaccurate on the borders of the frames, so it's better to avoid very large values.

VPetukhov commented 2 years ago

Hello Arun! That's a great question. I'd love to add parallelism, but at the moment I don't see a good option on how to do it. Previously I used Julia multiprocessing, and it essentially split dataset into rectangles and processed them independently. It created artifacts around rectangle borders, but scaled really well. I removed it because you can technically still do it by manually splitting your data, while it isn't possible to build transferable binaries with multiprocessing code in julia. A better option would be to use Julia multithreading, but it is still experimental, and I don't use it because garbage collection is still single-threaded, which killed scalability on my tests. It should be possible to make the code 2-3 times faster with that, but probably not more. So I didn't bother.

As the result, I'm waiting for a Julia release with better multithreading or for some inspiration on how to make it work within the current constraints. Maybe at some point I'll give up and re-write the most sensitive parts into C++, but I'd rather avoid that. :) If there will be a lot of attention to this issue, I'll reconsider the priorities and will focus on this part, but so far I can't promise fixing it soon.

arundasan91 commented 2 years ago

Thank you Viktor for the reply!

Data parallelism, especially on segmentation projects, do create artifacts around the patch borders. That is something I found troublesome as well. What do you think about a sliding window approach with overlap? Do you think something like that would reduce the artifacts by a little?

Julia is completely new to me. I am struggling with it actually. C++ or Python would have been amazing to see, for speed and integrations with deep learning frameworks. But I do understand that this is a huge undertaking. Thanks a lot for considering parallelism if enough interest shows up, and I wish you the very best on your PhD!

Cheers!

VPetukhov commented 2 years ago

Thanks for all the kind words!

What do you think about a sliding window approach with overlap?

I guess it could. Do you have any papers in mind on how to do the aggregation? Or I can just try something naive, it should be better than nothing, anyway.

C++ or Python would have been amazing to see, for speed and integrations with deep learning frameworks

Can you please elaborate on the deep learning frameworks part? So far my assumption was that all kinds of such integrations can be done simply by saving required information after the algorithm is done. If there is some kind of that, which you need, I can just add it to the julia code. Or do you think about embedding new methods into Baysor intermediates? That would be very helpful to know. And in any case, it should be straightforward to call julia functions from python, and I'm going to add a demo notebook for that at some point.

arundasan91 commented 2 years ago

I guess it could. Do you have any papers in mind on how to do the aggregation? Or I can just try something naive, it should be better than nothing, anyway.

A naive approach would be to stride by half of the patch size - then combing the results via some sort of voting. There is a paper in PLOS ONE which improves patch-based image segmentation: https://doi.org/10.1371/journal.pone.0229839

Can you please elaborate on the deep learning frameworks part? So far my assumption was that all kinds of such integrations can be done simply by saving required information after the algorithm is done. If there is some kind of that, which you need, I can just add it to the julia code.

Maybe it's my lack of knowledge in Julia that made that comment. I was thinking about an integration with, say PyTorch, to improve/compare Baysor with other deep learning algorithms. You're right, most of those comparisons can be done after Baysor - so need for hard integrations.

Or do you think about embedding new methods into Baysor intermediates? That would be very helpful to know. And in any case, it should be straightforward to call julia functions from python, and I'm going to add a demo notebook for that at some point.

Any new notebooks on working with custom datasets and integrating Python would be really great. I'd look forward to it.

VPetukhov commented 2 years ago

Thanks for the answers! I'll keep you updated here.

pander21 commented 2 years ago

Hello @VPetukhov,

I was wondering if there's any progress on parallelization or making the algorithm computationally more effective? We're now having datasets with tens to hundreds of millions of transcripts for which Baysor takes prohibitively long time on full dataset.

Thank you! Peter.

quentinblampey commented 1 year ago

Hello @VPetukhov, do you have any update on this? I'm also running on very large images (about 1000M transcripts per image), and, as said above, the slicing workaround creates artifacts.

What do you think about the following steps:

creating a delaunay-graph based on the prior segmentation
creating a partition of the graph using a graph cut strategy (so that each subgraph has weak connections to other graphs)
training Baysor on each subgraph
merging the results: hopefully we will not have a lot of artifacts given how we split the graph

Do you think it's a good idea? If it is, do you think it can be implemented? Unfortunately I don't know Julia very well, so for now I'm working on the graph-cut part in python.

VPetukhov commented 1 year ago

Hello @quentinblampey , I'm currently working on a big update inspired by 10x Xenium release. There will be a bunch of improvements, including a lot of memory and performance optimizations. Among other, I'm almost ready with a Python wrapper, so you'd be able to split data in any way you want. As for the graph cut method you suggest, I thought about this approach, and if nothing else work I will implement it. The problem here is that it doesn't provide any guarantees: it will work on sparser cell population, but for dense datasets it will have approximately as many artifacts as rectangular cuts.

Can I also ask your feedback on an unrelated topic? I'm currently puzzled with how to do plotting for huge datasets. We clearly aren't able to show all molecules in an html report the way we do now. Do you even need it, and if you do, what would be the best way for you? I can see a few options:

Show only a small region of the dataset, ~1k-10k cells
Create a huge tiff and suggest using ImageJ
Reference tools like vitessce and provide a quick example allowing users to make these plots themselves
Keep plotting functionality only within a Python wrapper (i.e., leave it to the user)

quentinblampey commented 1 year ago

Hello @VPetukhov, thanks for your answer, I'm very excited to try the new release!

To answer your question: actually we have a software dedicated to visualisation, so I'm not using the plots from Baysor. I prefer having a faster script that generate no plot at the end, because I can check the segmentation on the software anyway.

But if I wanted this, my opinion is that it would be better to run the script with no plot, and then add the possibility to use an API to look at small regions we are interested in (without having to re-run the script) and some optional parameters like a gene list to focus on (sometimes, a few genes like EPCAM / PTPRC can provide good insights). Or maybe vitessce, but I never tried before.

VPetukhov commented 1 year ago

Hi @quentinblampey , Thank you for your feedback! The new release is there. It fixes a problem with memory consumption on large datasets, and also introduces some basic parallelism. Though it's mostly about multi-threading in utility functions. And I'm still working on python wrappers. The results will be in baysorpy.

quentinblampey commented 1 year ago

Thanks for the update and the release @VPetukhov, I'll try that out!

quentinblampey commented 9 months ago

Hello @VPetukhov, I recently released a pipeline (https://github.com/gustaveroussy/sopa) for all mage-based spatial-omics, whose segmentation can be based on Baysor

It is very memory efficient, and scales to large datasets (see here), because I split the data into overlapping tiles, and then merge the results. The process that merges the results is quite simple, but it appears to work quite well. Don't hesitate to let me know if it seems interesting for you!

(I also mention @arundasan91 and @pander21, who were interested in this)

VPetukhov commented 1 month ago

v0.7.0 improves parallelization. Now, one of the main bottlenecks is Julia single-threaded garbage collector, and it also gets improved with new julia versions. So I close the issue as the algorithm is now parallel, and it will gradually improve.

Also, thanks to @quentinblampey for his work on scaling the pipeline!

kharchenkolab / Baysor

About Multithreading / Parallelization #25