talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
427 stars 97 forks source link

Fail gracefully when encountering seeking issues during inference #1711

Open roomrys opened 6 months ago

roomrys commented 6 months ago

The random access seeking issue has been longstanding and a major pain point.

We often tell our users to reencode their videos, but this is a pain, increases disk footprint, requires an extra processing step and etc. It's also buried deep in the docs, so most people don't find it. Finally, it's a terrible user experience when you run inference on an entire video (which may take hours!) only to have it crash on the very last frame...

In some cases, the same video file can be seeked on one platform but not another due to OS, ffmpeg and other layers of platform-dependent implementation differences.

See #932 and #945 for an in-depth analysis of the root problem.

Since there doesn't seem to be a very good universal solution, one thing we could do is to add a try/except in the inference block (something like we do in this gist).

(@roomrys: This is a good dataset that when git cloned seems to always throw the KeyError. -- @talmo: I can't reproduce on my end :()


Other relevant issues/discussions:

talmo commented 6 months ago

1712 implements the try-except version of this solution.

There's still some problems that we might need to address moving forward:

The last point was something we were trying to address to try and get at the root cause. Namely, the suspect is tf.data.Dataset and how it wraps the VideoReader provider which is doing the calls to the actual sleap.Video and backends (e.g., OpenCV).

Here's a little exploration Colab on comparing different ways to access videos with and without tf.data.Dataset.

And here's a Gist implementing a standalone sequential inference script. This tries to use threading to read async from the inference thread, but in general it works very similarly to the sleap-track CLI (intended to be a nearly drop-in replacement). A couple of interesting observations from this experiment:


In any case, the fix in #1712 should move us forward and we can revisit to address the above concerns as they come up, or punt it to sleap-io.