memory usage in locate and linking

zoeith commented 8 years ago

Hello, We've been trying to "process an unlimited number of frames" but hitting a hiccup with memory usage. Running in a jupyter notebook, the following eventually gobbles up all of the 128GB of our workstation while processing 200k frames with ~14k features per frame. Unfortunately, the scale needed to reproduce this also makes it difficult to provide a MWE. Any hints on what might be up or where to start with debugging?

frames = pims.ImageSequence('/path/to/my/data/*.tif', as_grey=True)
start = time.time()
with tp.PandasHDFStore('/path/to/my/store.h5') as s:
   tp.batch(frames, 5, minmass=2, percentile=10, invert=False, characterize=False, output=s)
end = time.time()
print (end-start)
start = time.time()
with tp.PandasHDFStore('/path/to/my/store.h5') as s:
   for linked in tp.link_df_iter(s, 0.4, memory=5):
       s.put(linked)
end = time.time()
print (end-start)

The code made it through the feature finding, ok and then memory usage built during linking. Using anaconda (py2.7) we're updated to the latest pandas, numba, trackpy, etc...

nkeim commented 8 years ago

It's great to learn about someone else scaling trackpy! I would start by testing two things:

Run linking but without storing the results.
Don't store the results and feed it the same input 200,000 times (with itertools.repeat()). This will bypass portions of the algorithm and bypass HDF reading.

The other major memory leak we had in the past was when there were large numbers of dropped particles over many frames. That was fixed in 0.3.0, and I routinely track 30k particles for 20k frames with low memory consumption, but in general we have never had a formal "stress test" for trackpy. It would be nice to have one.

I'm very curious about what you find and I hope we can solve this quickly!

On Jun 1, 2016, at 6:51 AM, zoeith notifications@github.com<mailto:notifications@github.com> wrote:

Hello, We've been trying to "process an unlimited number of frames" but hitting a hiccup with memory usage. Running in a jupyter notebook, the following eventually gobbles up all of the 128GB of our workstation while processing 200k frames with ~14k features per frame. Unfortunately, the scale needed to reproduce this also makes it difficult to provide a MWE. Any hints on what might be up or where to start with debugging?

frames = pims.ImageSequence('/path/to/my/data/*.tif', as_grey=True) start = time.time() with tp.PandasHDFStore('/path/to/my/store.h5') as s: tp.batch(frames, 5, minmass=2, percentile=10, invert=False, characterize=False, output=s) end = time.time() print (end-start) start = time.time() with tp.PandasHDFStore('/path/to/my/store.h5') as s: for linked in tp.link_df_iter(s, 0.4, memory=5): s.put(linked) end = time.time() print (end-start)

The code made it through the feature finding, ok and then memory usage built during linking. Using anaconda (py2.7) we're updated to the latest pandas, numba, trackpy, etc...

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/soft-matter/trackpy/issues/373, or mute the threadhttps://github.com/notifications/unsubscribe/AC2Nba0nWDeRs0kt8Z3eJbYmCsdwtpYOks5qHY5EgaJpZM4IrkQ-.

zoeith commented 8 years ago

@nkeim just a brief update: I kicked off your first suggestion, i.e. linking without storing, and while I'm only ~100k frames in, the memory usage of the process seems reasonable and has stayed constant for a while now. So I guess testing the second version (using the same frame repeatedly) isn't necessary.

Does this mean there's a problem in pandas hdfstore somewhere? I don't see anything obvious in framewise_data.py that would be an issue. I've run into other problems with pandas storing large amounts of data (in table format in particular).

By the way, is there a reason why there isn't a particle_wise accessor class? I'm pretty sure we'd like this functionality in my group. An hdfstore indexed both ways would be great. I guess it would eat a lot of space but it's kind of necessary for speedy operations on datasets this big.

nkeim commented 8 years ago

I was afraid of this! Beneath PandasHDFStore are pandas, pytables, and HDF5, and all have limitations. I have a few suggestions, and others can weigh in:

Try PandasHDFStoreSingleNode, which literally stores all of your data in a single on-disk table. This may actually scale better than storing frames separately, since HDF5 is optimized for tens of nodes in a file, not 10^5 (trackpy.PandasHDFStoreBig is a partial workaround for this limitation). This is a relatively thin wrapper around pandas.HDFStore in "table" mode, which offers database-like indexing and querying features. This would in principle help with doing particle-wise queries. A thorough reading of the relevant Pandas (and maybe pytables) documentation could allow you to get decent performance.
Go one level lower and use pytables directly. There is some slippage between pytables and pandas that may be leading to problems. Again, this should make particle-wise queries possible.
Partition your data set into tens of "shard" files, each with a more manageable number of frames. Write a framewise_data class that comprises many PandasHDFStoreBig objects.
Roll up your sleeves and use sqlite, which is an industrial-strength relational database. I suspect that would be a pretty steep learning curve unless you have prior SQL experience.

…and one more totally wild idea: If you have few dropped/added particles over the course of your movie, that means you have a reasonable upper bound M on the particle ID. You could then use pytables or h5py to store your data in a MxNxc array, where c is the number of data columns. Even though many entries would be blank, with appropriate compression this would not use a ton of extra space. pytables and especially h5py have strong support for this; you would probably want to use the "chunked" layout for your HDF5 file so that framewise writing and reading would have tolerable performance. This is arguably the simplest solution of all, though it is the most radical departure from the others.

caspervdw commented 8 years ago

I am totally not experienced with datasets of this scale, but it might be interesting to try out appending the data to a dask DataFrame: http://dask.pydata.org/

nkeim commented 8 years ago

@zoeith I forgot that I already implemented my 2nd proposed solution back when I was at Penn. You can find the code here: https://github.com/nkeim/pantracks/ It may be a little rusty at this point and I cannot vouch for how well it scales to your application. But it worked well when I was using it. Let me know if you have any questions.

And if you find something that works, or almost works, please post about it! It could turn into a welcome addition to trackpy.

tacaswell commented 8 years ago

I second number 2 going further and use h5py directly, but keep the total number of groups / datasets low. It is probably not the source of your memory consumption, but with large numbers of small data sets you can start to see performance degradation in reading data out.

The results of linking used to be track wise, but that had it's own performance issues. I think there is a way to inject a Track class which will provide that functionality, but I have been away from this code for too long :disappointed: .

Keeping both per-frame and per-particle information handy drove me to do strange things in https://github.com/tacaswell/tracking , both in the files and in the in-memory data structures.

zoeith commented 8 years ago

Thanks all. I think we're willing to sacrifice disk space to have efficient access to both frame and track representations, so I'll work toward to building a data structure and access class that encapsulates that.

I will report back when I've got some real progress...

caspervdw commented 7 years ago

Any progress @zoeith?

zoeith commented 7 years ago

Ah. Not really. We went with the simple fix. Divided data into manageable chunks and treated things separately. Turned out to be ok for our purposes.

Unfortunately, I am unlikely to make much more progress on this anytime soon. Though I recall looking into Dask and Blaze as you suggested and thinking it seemed promising.

soft-matter / trackpy

memory usage in locate and linking #373