The random access seeking issue has been longstanding and a major pain point.

We often tell our users to reencode their videos, but this is a pain, increases disk footprint, requires an extra processing step and etc. It's also buried deep in the docs, so most people don't find it. Finally, it's a terrible user experience when you run inference on an entire video (which may take hours!) only to have it crash on the very last frame...

In some cases, the same video file can be seeked on one platform but not another due to OS, ffmpeg and other layers of platform-dependent implementation differences.

See #932 and #945 for an in-depth analysis of the root problem.

Since there doesn't seem to be a very good universal solution, one thing we could do is to add a try/except in the inference block (something like we do in this gist).

(@roomrys: This is a good dataset that when git cloned seems to always throw the KeyError. -- @talmo: I can't reproduce on my end :()

Other relevant issues/discussions:

366
531
566
630
649
765
767
1095
1153
1508
1703
1707

1712 implements the try-except version of this solution.

There's still some problems that we might need to address moving forward:

Are there others places in the code that this affects?
Do we get seeking issues earlier in videos? In this case, they'd be truncated as soon as the error happens.
Do we get misaligned poses and video frames? This PR might mask the underlying problem in these cases.
Why does this not happen during training or from the GUI when seeking to the same frame?

The last point was something we were trying to address to try and get at the root cause. Namely, the suspect is tf.data.Dataset and how it wraps the VideoReader provider which is doing the calls to the actual sleap.Video and backends (e.g., OpenCV).

Here's a little exploration Colab on comparing different ways to access videos with and without tf.data.Dataset.

And here's a Gist implementing a standalone sequential inference script. This tries to use threading to read async from the inference thread, but in general it works very similarly to the sleap-track CLI (intended to be a nearly drop-in replacement). A couple of interesting observations from this experiment:

We still get the seeking error in some cases it seems, even without tf.data.Dataset.
Using multiprocessing instead of threading results in weird errors with OpenCV -- maybe the out-of-process stuff + OpenCV is the root culprit?

In any case, the fix in #1712 should move us forward and we can revisit to address the above concerns as they come up, or punt it to sleap-io.

talmolab / sleap

Fail gracefully when encountering seeking issues during inference #1711

366

531

566

630

649

765

767

1095

1153

1508

1703

1707

1712 implements the try-except version of this solution.