Particles with the same distance

snilsn commented 1 year ago

Hello everybody! During experiments with artificial datasets, I have encountered a situation where I am not able to achieve reproducible behavior of trackpy. This happens when there are two particles in a frame, but only one particle in the next frame, with equal distance to both particles.

There are two options for linking in such a case and both are equally correct (or incorrect) without using any predictors, but it seems that it is random which option is chosen by trackpy. Is there any way to achieve reproducibility here, e.g. by setting a random seed?

I'm using trackpy v0.5.0. A minimal example:

import trackpy as tp
import pandas as pd
import matplotlib.pyplot as plt

d = {'frame': [1, 1, 2], 'x': [0, 2, 1], 'y': [0, 0, 1]}
df = pd.DataFrame(data = d)

fig, ax = plt.subplots(ncols = 3, nrows = 3, figsize = (10, 10), sharex=True, sharey=True)

for axes in ax.flatten():
    track = tp.link(df, 10)
    tp.plot_traj(track, 
                 ax = axes, 
                 plot_style={'marker':'x'}
                )
plt.show()

variants

nkeim commented 1 year ago

Thanks for this minimal example! This is an edge case that might have to do with the performance improvement in #597 that was discussed further in #601 .

Is the output of your script consistent from run to run? If not, we identified there that setting the PYTHONHASHSEED environment variable before Python starts could make the behavior stable. Is that something you can try?

We never quite settled the question of whether to make this configurable, as proposed in #657 , and so I ultimately decided to follow "YAGNI". But your own thoughts would be appreciated!

nkeim commented 1 year ago

(The other possibility is that this an instability in how cKDTree handles degeneracy. That might be much harder to overcome. An easy thing to try would be to run the linker with the argument neighbor_strategy='BTree'.)

snilsn commented 1 year ago

The output of the script differs from run to run, but setting PYTHONHASHSEED beforehand indeed provides stability (both options are still present, but in the same order every run). So thanks for this suggestion!

I encountered this while designing tutorials for another python module that uses some trackpy functionality and it caused some headaches, but it is obviously a pretty rare case in natural datasets. If it occurs it could lead to some small, but hard to detect reproducibility problems, especially if the linking is part of a larger analysis.

Additionally, there seems to be no comfortable way to set PYTHONHASHSEED in Jupyter.

nkeim commented 1 year ago

Good! I suspect that trackpy v0.4.x would have had the same inconsistent behavior for degenerate candidates… it previously sorted candidates by distance only, which for degenerate candidates would have changed nothing.

In any case, I don't think we even considered degeneracy in our earlier discussion. I can see now how it might arise even in real datasets, if positions can only be determined to the nearest pixel. It seems like the most correct behavior would be to issue a warning—from a scientific perspective, it's bad for trackpy to silently inject an arbitrary choice into your results, whether that choice is consistent or not. However, properly checking for degeneracy has to be done during linking, and it would certainly hinder performance in every other case, so it would have to be optional.

I'm going to close this and leave it as a reference (or warning!) for future users, unless there's a simpler solution I'm not seeing. Thanks again for so perfectly documenting this behavior, @snilsn !

snilsn commented 1 year ago

After a bit of consideration I think I have to bother you again, @nkeim

There are two more questions I have:

Why isn't this happening in the opposite case (i. e. going from 1 particle to 2 with the same distance)? This example shows only 9 attempts, but I tried far more. There seems to be only one valid option for trackpy in that case, despite both being equally correct again.

import trackpy as tp
import pandas as pd
import matplotlib.pyplot as plt

d = {'frame': [2, 2, 1], 'x': [0, 2, 1], 'y': [0, 0, 1]}
df = pd.DataFrame(data = d)

fig, ax = plt.subplots(ncols = 3, nrows = 3, figsize = (10, 10), sharex=True, sharey=True)

for axes in ax.flatten():
    track = tp.link(df, 10)
    tp.plot_traj(track, 
                 ax = axes, 
                 plot_style={'marker':'x'}
                )
plt.show()

split

snilsn commented 1 year ago

From our conversation I assumed that in the original case both options for the linking have to be equally probable. But after some experiments I did I found that one option is far more probable than the other. Is there any explanation for that?

import trackpy as tp
import pandas as pd
import matplotlib.pyplot as plt

d = {'frame': [1, 1, 2], 'x': [0, 2, 1], 'y': [0, 0, 1]}
df = pd.DataFrame(data = d)

lengths = []
for axes in range(5000):
    track = tp.link(df, 10)
    lengths.append(len(track.where(track['particle']==0).dropna()))

plt.hist(lengths, bins=[0.8, 1.2, 1.8, 2.2])
plt.xticks([1, 2])
plt.xlabel('lifetime in frames')
plt.ylabel('count')
plt.title('particle 0')

hist

soft-matter / trackpy

Particles with the same distance #707