Index-handling in filter functions and other feedback

lagru commented 6 years ago

Hello devs, first of all I'd like to thank you for your work on this small but promising library. I'm currently wrapping parts of your library to use with a command line tool intended for particle tracking at my university. Thanks to your great example section and generally well documented API that process is very easy and satisfying.

However I also encountered some minor hiccups and annoyances. Maybe you find my feedback in that regard useful:

Using pandas.DataFrame as the primary data container in your library is a choice I'm not sure about and certainly not used to when comparing trackpy to other scientific libraries.
- I don't understand the index-handling for the functions in filtering.py. During grouping the index is dropped (why not move the index back to into a column?) and later "frame" is explicitly set as the index. In that last step the keyword drop=False is used which doesn't make sense because any would-be index was dropped during grouping? Is there a reasoning behind this? To extend on this, wouldn't it be clearer if all functions treated dataframes and indexes the same way: e.g. leave index untouched or at least be explicit about it in the docstring. I was initially confused several times why this or that function chose to set an index. To me that often seemed arbitrary and not intuitive.
- I also think in many cases the docstrings could be more explicit about the required columns and especially the structure of returned dataframes. I think this is a disadvantage when using pandas.DataFrame as the primary data container: it obfuscates the in- and output compared to simple arrays provided through multiple well documented arguments. Maybe a type declaration like df : pandas.DataFrame[frame, particle, x, y] could be used within the docstrings to make this clearer.
Your example on parallelized feature location got me thinking why you don't support something equivalent in trackpy.batch. Even multi-threading can speed up this processing step for large sets by magnitudes and can easily be accomplished without depending on ipyparalell. E.g.:

from functools import partial
from multiprocessing.pool import ThreadPool

def batch_threaded(frames, locate_kw, threads=4, report_hook=None):
    if not report_hook:
        def report_hook(): pass
    func = partial(trackpy.locate, **locate_kw)
    with ThreadPool(threads) as pool:
        stats = []
        for i, frame_stats in enumerate(pool.imap(func, frames)):
            frame_stats["frame"] = i
            stats.append(frame_stats)
            report_hook()
    return pd.concat(stats, ignore_index=True)

Although calculating the displacement of linked particles between consecutive frames isn't hard I think a function doing that might be an useful addition to this library. Would you agree?

I'm aware that this criticism is subjective and these are only minor points. So please take this as well meant feedback. Of course I would be happy to work on these points myself if help is desired! :slightly_smiling_face:

caspervdw commented 6 years ago

Hi @lagru, thanks for your feedback, it is always welcome! Per point:

I think @danielballan wrote the code of filtering.py in 2014/15, so he could best comment on that. In my opinion, your points are valid and a PR addressing those is most welcome. Also, the pandas 0.12 workaround function filter is no longer necessary.
Using dataframes as primary data container got me confused when I started using Trackpy as well. More explicit docstrings would help indeed. Just as a guide: https://python-sprints.github.io/pandas/guide/pandas_docstring.html
Parallel batch has been on our wishlist for some time #304. Your example would be welcome as a PR!
Maybe the function motion.relate_frames is what you are looking for?

nkeim commented 6 years ago

Addressing the broader point of pandas: it does have a learning curve and at first I wasn't sure what is so great about it. But it has made trackpy easier to write and, in some places, both faster and more versatile than we would have been willing to make it ourselves. It is a very natural fit for tracks data, so I encourage you to learn more about it. I think that @danielballan would agree with me that more scientists should be using it!

That said, docstrings, tutorials, and (maybe) helper functions could make it easier to use trackpy if you are unfamiliar with pandas. To echo @caspervdw , please make a note of the sticking points you encounter and please do suggest changes!

soft-matter / trackpy

Index-handling in filter functions and other feedback #498