soft-matter / trackpy

Python particle tracking toolkit
http://soft-matter.github.io/trackpy
Other
443 stars 131 forks source link

Parallel batch #304

Closed apiszcz closed 6 years ago

apiszcz commented 8 years ago

The only reference I could find for parallel batch mode is the following (title only, no document). Is there a reason why multiprocessor or job is not a part of this capability?

View all pull requests Parallelizable Locate
 Success: The Travis CI build passed

nkeim commented 8 years ago

What kind of API would this entail? Do-it-yourself parallelization is already pretty easy, as shown at the end of the walkthrough notebook.

I can see the value in having something like this baked into batch(), as long as it is cross-platform and requires no additional setup.

There is already some movement toward making parallelization support more official. See #286.

apiszcz commented 8 years ago

The easiest might be multiprocessing, then allow user to set ncpus if they choose? This seems like a 'relatively' painless way to achieve some speedup. Should be cross platform, chunksize default would the images/ncpus. On 286, I'll standby and use / test when ready.

On Wed, Nov 4, 2015 at 2:20 PM, Nathan Keim notifications@github.com wrote:

What kind of API would this entail? Do-it-yourself parallelization is already pretty easy, as shown at the end of the walkthrough notebook https://render.githubusercontent.com/view/ipynb?commit=d7f84f90e039cb9027e23a100bc1f2d9d793a44a&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f736f66742d6d61747465722f747261636b70792d6578616d706c65732f643766383466393065303339636239303237653233613130306263316632643964373933613434612f6e6f7465626f6f6b732f77616c6b7468726f7567682e6970796e62&nwo=soft-matter%2Ftrackpy-examples&path=notebooks%2Fwalkthrough.ipynb#Parellelization .

I can see the value in having something like this baked into batch(), as long as it is cross-platform and requires no additional setup.

There is already some movement toward making parallelization support more official. See #286 https://github.com/soft-matter/trackpy/issues/286.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153836318 .

danielballan commented 8 years ago

I would rather see us adopt more modern tooling. I have limited experience with IPython.parallel -- no sophisticated use cases under my belt -- but my general understanding is that it is essentially multiprocessing with a better API.

apiszcz commented 8 years ago

As long as whatever is chosen can work without a lot of 'extra' installation/setup. We need really basic dispatch/map , merge , should be doable a few ways, multiprocessor is always there. Ideally it would support SMP on same host and possibly a network of hosts. I am concerned about having a lot of extra requirements for capability. https://wiki.python.org/moin/ParallelProcessing

On Wed, Nov 4, 2015 at 2:40 PM, Dan Allan notifications@github.com wrote:

I would rather see us adopt more modern tooling. I have limited experience with IPython.parallel -- no sophisticated use cases under my belt -- but my general understanding is that it is essentially multiprocessing with a better API.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153841567 .

apiszcz commented 8 years ago

So the reason I'm interested in this may be my lack of understanding of squeezing the most out of the parameter combinations for location/batch (still learning) I have a black background and 2 to 3 white objects, batch appears to process a few images ~900x500pixels/second. Is this reasonable? I thought it would take less time with fewer objects to detect.​

df = tp.batch(frames[0:500], diameter=19, minmass=100, noise_size=9, smoothing_size=21, threshold=1)

On Wed, Nov 4, 2015 at 2:20 PM, Nathan Keim notifications@github.com wrote:

What kind of API would this entail? Do-it-yourself parallelization is already pretty easy, as shown at the end of the walkthrough notebook https://render.githubusercontent.com/view/ipynb?commit=d7f84f90e039cb9027e23a100bc1f2d9d793a44a&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f736f66742d6d61747465722f747261636b70792d6578616d706c65732f643766383466393065303339636239303237653233613130306263316632643964373933613434612f6e6f7465626f6f6b732f77616c6b7468726f7567682e6970796e62&nwo=soft-matter%2Ftrackpy-examples&path=notebooks%2Fwalkthrough.ipynb#Parellelization .

I can see the value in having something like this baked into batch(), as long as it is cross-platform and requires no additional setup.

There is already some movement toward making parallelization support more official. See #286 https://github.com/soft-matter/trackpy/issues/286.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153836318 .

nkeim commented 8 years ago

@apiszcz That does not seem unusually slow.

The nice thing about multiprocessing is that it "just works" on modern systems. So we really could add an ncpu option to batch(). That's very appealing! It just makes it faster.

IPython gives you higher performance and a ton of other features (including network support) for free. In that case, the optional parameter would be an IPython load-balanced view. It's extra work for the user. But it's how scientists should be doing most of their parallelization. (OTOH, as you show, rolling your own in the notebook is trivial once you understand what you're doing.)

I'm torn. This may be an issue decided by the first good PR that solves it.

nkeim commented 8 years ago

@apiscz To give a more detailed response: you should run a profiler.

Now that I think about it, when you have that few features the default value of percentile is probably inappropriate.

tacaswell commented 8 years ago

Another path to take here is to get a hold of a mkl complied version of numpy/scipy which (I think) will automatically parallelize the numpy/scipy operations under the hood.

We should probably also look into using dask, either at the low level image processing operations or the high-level frame-level.

The bottle neck in this operation changes depending both the image size and the number of features. If you only have a few features, the bottle neck is almost certainly in the bandpass and dilation steps (if you have 10k particles the bottle neck moves to the localization/refinement).

If you only have a few features, it might be worth writing a version of batch that chops out sub-regions of your images to contain only the neighborhood around the particles.

I would also suggest making sure you actually zero out your black background, that will help with local-maximum finding.

apiszcz commented 8 years ago
  1. DASK I have peeked at it, good for certain approaches with castra.
  2. Low/LEVEL then it might include CUDA, dependencies.
  3. I will double check on my 0 value, images are gray scale.[image: Inline image 1]

On Wed, Nov 4, 2015 at 3:40 PM, Thomas A Caswell notifications@github.com wrote:

Another path to take here is to get a hold of a mkl complied version of numpy/scipy which (I think) will automatically parallelize the numpy/scipy operations under the hood.

We should probably also look into using dask, either at the low level image processing operations or the high-level frame-level.

The bottle neck in this operation changes depending both the image size and the number of features. If you only have a few features, the bottle neck is almost certainly in the bandpass and dilation steps (if you have 10k particles the bottle neck moves to the localization/refinement).

If you only have a few features, it might be worth writing a version of batch that chops out sub-regions of your images to contain only the neighborhood around the particles.

I would also suggest making sure you actually zero out your black background, that will help with local-maximum finding.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153855842 .

apiszcz commented 8 years ago

ipython setup on windows is a bit more work, than multiprocessing.​

On Wed, Nov 4, 2015 at 3:36 PM, Nathan Keim notifications@github.com wrote:

@apiszcz https://github.com/apiszcz That does not seem unusually slow.

The nice thing about multiprocessing is that it "just works" on modern systems. So we really could add an ncpu option to batch(). That's very appealing! It just makes it faster.

IPython gives you higher performance and a ton of other features (including network support) for free. In that case, the optional parameter would be an IPython load-balanced view. It's extra work for the user. But it's how scientists should be doing most of their parallelization. (OTOH, as you show, rolling your own in the notebook is trivial once you understand what you're doing.)

I'm torn. This may be an issue decided by the first good PR that solves it.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153854900 .

danielballan commented 8 years ago

Yes, very excited about dask. We may work on integrating pims with it better. Not sure yet what shape that would take.

On Wed, Nov 4, 2015 at 3:53 PM apiszcz notifications@github.com wrote:

  1. DASK I have peeked at it, good for certain approaches with castra.
  2. Low/LEVEL then it might include CUDA, dependencies.
  3. I will double check on my 0 value, images are gray scale.[image: Inline image 1]

On Wed, Nov 4, 2015 at 3:40 PM, Thomas A Caswell <notifications@github.com

wrote:

Another path to take here is to get a hold of a mkl complied version of numpy/scipy which (I think) will automatically parallelize the numpy/scipy operations under the hood.

We should probably also look into using dask, either at the low level image processing operations or the high-level frame-level.

The bottle neck in this operation changes depending both the image size and the number of features. If you only have a few features, the bottle neck is almost certainly in the bandpass and dilation steps (if you have 10k particles the bottle neck moves to the localization/refinement).

If you only have a few features, it might be worth writing a version of batch that chops out sub-regions of your images to contain only the neighborhood around the particles.

I would also suggest making sure you actually zero out your black background, that will help with local-maximum finding.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-153855842> .

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153862012 .

apiszcz commented 8 years ago

I hope to spend more time with PIMS including to get it to read mp4 files vs a pile of PNGs (expect it can use ffmpeg?) etc. Very nice API! Tx.​

On Wed, Nov 4, 2015 at 4:50 PM, Dan Allan notifications@github.com wrote:

Yes, very excited about dask. We may work on integrating pims with it better. Not sure yet what shape that would take.

On Wed, Nov 4, 2015 at 3:53 PM apiszcz notifications@github.com wrote:

  1. DASK I have peeked at it, good for certain approaches with castra.
  2. Low/LEVEL then it might include CUDA, dependencies.
  3. I will double check on my 0 value, images are gray scale.[image: Inline image 1]

On Wed, Nov 4, 2015 at 3:40 PM, Thomas A Caswell < notifications@github.com

wrote:

Another path to take here is to get a hold of a mkl complied version of numpy/scipy which (I think) will automatically parallelize the numpy/scipy operations under the hood.

We should probably also look into using dask, either at the low level image processing operations or the high-level frame-level.

The bottle neck in this operation changes depending both the image size and the number of features. If you only have a few features, the bottle neck is almost certainly in the bandpass and dilation steps (if you have 10k particles the bottle neck moves to the localization/refinement).

If you only have a few features, it might be worth writing a version of batch that chops out sub-regions of your images to contain only the neighborhood around the particles.

I would also suggest making sure you actually zero out your black background, that will help with local-maximum finding.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-153855842

.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-153862012> .

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153877685 .

danielballan commented 8 years ago

Yep, it uses FFmpeg.

Resonanz commented 8 years ago

If I understand correctly, there are two ways we could improve batch processing. The first is to use multiple CPU cores, which most recent PCs would have. The second is to use CUDA:

https://developer.nvidia.com/how-to-cuda-python

I wonder, has anyone re-written the Crocker-Grier algorithm for CUDA, and if so could that be called directly from trackpy? See the start of this paper:

http://iopscience.iop.org/1367-2630/16/7/075010/media/NJP497609suppdata.pdf

There is also mention of CUDA on Grier's page:

http://physics.nyu.edu/grierlab/software.html

nkeim commented 8 years ago

BeadTracker sounds very fast! But it uses CUDA only for the convolution operations in the bandpass step.

If CUDA implementations are widely available and are callable from Python by mortals, it would be an excellent first step, as these operations account for roughly half of the feature-finding time. If no good implementations exist, numbapro provides a reasonable method that is free for academic use.

Doing the rest of feature-finding on a GPU would be a good idea, but challenging. Doing linking on a GPU would probably require a wholesale rethinking of the algorithm, if it made sense at all.

Parallelizing feature-finding of a large movie presents a small scheduling/concurrency problem. The feature-finding tasks are producers, and the consumer is link_df_iter() or PandasHDFStore.put(). In my experience the feature-finding needs to be throttled or you will get a large backlog in memory. So some cleverness is required.

While we're at it, another HPC-friendly feature would be the ability to pickle the linker and resume where you left off. This way a large movie could be split into chunks, to overcome the run-time limitations of some cluster environments.

danielballan commented 8 years ago

I read that at least some of numbapro, including GPU code, has been open-sourced as of a couple days ago. On Fri, Nov 6, 2015 at 5:44 PM Nathan Keim notifications@github.com wrote:

BeadTracker sounds very fast! But it uses CUDA only for the convolution operations in the bandpass step.

If CUDA implementations are widely available and are callable from Python by mortals, it would be an excellent first step, as these operations account for roughly half of the feature-finding time. If no good implementations exist, numbapro provides a reasonable method that is free for academic use.

Doing the rest of feature-finding on a GPU would be a good idea, but challenging. Doing linking on a GPU would probably require a wholesale rethinking of the algorithm, if it made sense at all.

Parallelizing feature-finding of a large movie presents a small scheduling/concurrency problem. The feature-finding tasks are producers, and the consumer is link_df_iter() or PandasHDFStore.put(). In my experience the feature-finding needs to be throttled or you will get a large backlog in memory. So some cleverness is required.

While we're at it, another HPC-friendly feature would be the ability to pickle the linker and resume where you left off. This way a large movie could be split into chunks, to overcome the run-time limitations of some cluster environments.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-154566217 .

apiszcz commented 8 years ago

I need to have it working with Windows, It'll be a few weeks before I need to use mp4 vs. the current .png approach.​

On Thu, Nov 5, 2015 at 8:17 AM, Dan Allan notifications@github.com wrote:

Yep, it FFmpeg.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-154057087 .

danielballan commented 8 years ago

There are some people who have it working on Windows. Check out the PyAV github issues. As soon as I can replicate their success, I will make conda packages and it will be easy as pie. On Sat, Nov 7, 2015 at 9:14 AM apiszcz notifications@github.com wrote:

I need to have it working with Windows, It'll be a few weeks before I need to use mp4 vs. the current .png approach.​

On Thu, Nov 5, 2015 at 8:17 AM, Dan Allan notifications@github.com wrote:

Yep, it FFmpeg.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-154057087> .

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-154709263 .

apiszcz commented 8 years ago

Thank you for the feedback. 1) I am using the MKL compiled version of numpy/scipy (for windows) from http://www.lfd.uci.edu/~gohlke/pythonlibs/ 2) DASK, reading some of the other parallel thinking posts I see the coupling between tracking and detecting. I was hoping those could be separate stages/phases when needed.​ 3) There are only a few features, I'm attempting to blur them into a circular shape type objects that trackpy locate appears optimized for. The initial blurring attempt has resulted in a much more consisted track, 4) sub regions ROI, the # of objects will depend on the input data which I'm not controlling, so it will not be consistent. 5) The background pixel is black in the grayscale image with a value of 0.

On Wed, Nov 4, 2015 at 3:40 PM, Thomas A Caswell notifications@github.com wrote:

Another path to take here is to get a hold of a mkl complied version of numpy/scipy which (I think) will automatically parallelize the numpy/scipy operations under the hood.

We should probably also look into using dask, either at the low level image processing operations or the high-level frame-level.

The bottle neck in this operation changes depending both the image size and the number of features. If you only have a few features, the bottle neck is almost certainly in the bandpass and dilation steps (if you have 10k particles the bottle neck moves to the localization/refinement).

If you only have a few features, it might be worth writing a version of batch that chops out sub-regions of your images to contain only the neighborhood around the particles.

I would also suggest making sure you actually zero out your black background, that will help with local-maximum finding.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153855842 .

apiszcz commented 8 years ago

I have it ready (PyAV) and other libraries, you will hear about it either way :). Thanks.​

On Sat, Nov 7, 2015 at 9:16 AM, Dan Allan notifications@github.com wrote:

There are some people who have it working on Windows. Check out the PyAV github issues. As soon as I can replicate their success, I will make conda packages and it will be easy as pie. On Sat, Nov 7, 2015 at 9:14 AM apiszcz notifications@github.com wrote:

I need to have it working with Windows, It'll be a few weeks before I need to use mp4 vs. the current .png approach.​

On Thu, Nov 5, 2015 at 8:17 AM, Dan Allan notifications@github.com wrote:

Yep, it FFmpeg.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-154057087

.

— Reply to this email directly or view it on GitHub < https://github.com/soft-matter/trackpy/issues/304#issuecomment-154709263> .

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-154709329 .

apiszcz commented 8 years ago

The percentile default in the version of trackpy I have is 64, setting it to 95 did not improve performance over the default.​ Currently processing 500 frames in 139 seconds, ~95% of the frames have one object, 4% have 2 or 3 objects, and the remaining 1% have no objects.

On Wed, Nov 4, 2015 at 3:38 PM, Nathan Keim notifications@github.com wrote:

@apiscz To give a more detailed response: you should run a profiler.

Now that I think about it, when you have that few features the default value of percentile is probably inappropriate.

— Reply to this email directly or view it on GitHub https://github.com/soft-matter/trackpy/issues/304#issuecomment-153855441 .

hadim commented 8 years ago

You might also want to consider joblib (https://pythonhosted.org/joblib/). This is a very light wrapper to multiprocessing module.

The API is very clean and simple to work with. I used it for a few years and it makes the code more readable and easy to maintain/read than using the multiprocessing module that Python provides.

nkeim commented 8 years ago

Follow-up on one of my earlier comments: I've implemented a "throttled parallel map" function that avoids excessive accumulation of results in memory: https://gist.github.com/nkeim/24428112a24a264b6d75

This thread is becoming a trove of resources for whoever eventually tackles this problem…

lagru commented 6 years ago

I haven't taken the time to read this full thread, but with #499 being merged this issue may be a candidate for closing.

nkeim commented 6 years ago

@lagru I agree, it's time to close this issue. #499 solves the problem for most users without introducing new dependencies. To take this further, we'd need to better identify what are the obstacles to using e.g. dask with common tracking workflows, and tell users about possible solutions. That's a conversation I'd want to be a part of—I use dask for my very large data sets, but because I want to feed the locate results directly to the linker (which is the bottleneck), I can't use dask's higher-level APIs without running out of memory. I do make heavy use of a modified version of the "throttled map" function I posted above.