pupil-labs / pupil

Open source eye tracking
https://pupil-labs.com
GNU Lesser General Public License v3.0
1.49k stars 679 forks source link

Fix receiving high frequency notifications #1275

Closed mattem86 closed 6 years ago

mattem86 commented 6 years ago

Sometimes the mp4 files are not readable (even with VLC) – I understand, that is due to corrupted headers (although I do not understand why it is happening so often) .... what I missed until today, is that the eye*.timestamps file for corrupted videos is missing!

we use windows, 1.8.26.... issue is apparent with both codecs (big and small file) ... world video seems to be always fine. we're using pupil capture, not mobile.

In rare cases only one eye is unreadable, but the other is fine.

ffmpeg says "moov atom not found" for the video files.

So far no clue inside the logs – both eye videos in this recording are damaged and the timestamps are missing.

capture.log

mkassner commented 6 years ago

Locking at the log it seems that you are writing video files into some db?

Can you confirm that this is not a harddrive or PC issue?

mattem86 commented 6 years ago

No, the path is a bit lengthy, but the files are written on SSD. We experienced this on two different notebooks, either on SSD and HDD. Both running in Windows, maybe this is a hint.

If I recall it correctly, the problem did not show up in v1.6... since v1.7 there are some of the videos corrupted.

mkassner commented 6 years ago

Did any process crash (eye) during the recording. I have not seen this before.

mattem86 commented 6 years ago

No, everything looked normal. It can be that the signal for stopping the recording was sent twice, but this should not be a problem. The duration of the recording is about 5 min, we record in VGA resolution. One eye file is about 1.5 gig large. In a hexeditor it is visible that the writing of the mp4 file abruptly stops. One „hotfix“ to this problem could be writing of „web-optimized“ videos, where the header is written in the beginning. Additional to that I would highly recommend to either write the timestamp file for the video more robustly or think about of storing some way of frame index in pupil data. We have 20+ recordings with corrupted files.

mattem86 commented 6 years ago

Tomorrow I will try to roll back to v1.6 and make a few recordings to see if they are good.

mattem86 commented 6 years ago

So... I feel I got a little further. The problem could be, that the recording (Pupil Capture, respectively) is shut down ungracefully. It is totally understandable that the recording is not finished correctly, if the process is terminated. BUT – it seems that sometimes the recording will continue to record, even if the (R) in capture is not red anymore. Is this possible? We have videos which are 1 minute longer than expected where we already put our headset off. I do not supervise every measurement we take, but I'm pretty sure that the recording officially stopped. Maybe (!) something changed from v1.6 to v1.7 that the recording silently continued for several seconds?

mattem86 commented 6 years ago

(In the meantime I will find the person or code snippet which terminates pupil capture so barbaric)

mattem86 commented 6 years ago

Ok, that's it, at least a clue: If you look at log above and this one, you can see that in the above log the rows "eye* - INFO launchables.eye: Done Recording" are missing. If you look in the log here closely, you can see, that these rows are 83 seconds later than "world - INFO Saved recording"! In the screenshot you can see, that the R is not red anymore.

Second: We use the frame publisher plugin to show the eyes in VR. We turn it off during recording (via network) and on again, if the signal "recording_stopped" is received. We noticed that it takes a long time that the eyes in VR are coming back – today we saw that they come back, when the eye video is written (83 secs late).

bildschirmfoto 2018-08-24 um 10 09 49

capture (1).log

mattem86 commented 6 years ago

As I said, we never saw this behavior in 1.6 – so the question is: Why do the eye processes continue to record so long after receiving the stop recording signal?

mkassner commented 6 years ago

Ok. It looks like the eye process did for some obscure reason not receive the recording_stopped notification. Maybe the eye process was terminated early? (it would still save on close but maybe it crashed?)

mattem86 commented 6 years ago

no, the processes did run until the eye videos were written and stopped correctly. maybe it is a problem of getting two "stop recording" notifications shortly after another? we noticed that we have to use a delay of 2 seconds between starting the two eye processes via network because otherwise only one eye process is started.

mkassner commented 6 years ago

This could be a timing issue, but I dont know for sure. We would have to make some tests and look at the implementation closely.

mattem86 commented 6 years ago

Update: This is not an issue due to subsequent recording_stop signals. I have reduced it to a single call and it still happens regularly. Sometimes 5 seconds, sometimes 2 minutes. It's not for sure, but I have the feeling it's worse when using the "small file, more cpu" option.

mattem86 commented 6 years ago

Since it is so random, maybe it could be a issue with the garbage collection? Is everything flushed etc?

mattem86 commented 6 years ago

Update: I can confirm, this issue only exists in v1.8 .... v1.6 and v1.7 are working fine (Videos are written instantly when the signal for recording.should_stop arrives). I tested it under the exact same conditions. AND: It is only an issue under Windows – I ported my Unity project to Mac and ran it completely on Mac, and the videos are written instantly in v1.8.

papr commented 6 years ago

Thank you for the update and your debugging work! This helps us a lot!

mattem86 commented 6 years ago

I'm not sure about this, but it seems – don't ask me why, it looks very weird – that the issue only appears, if I do some Main-Thread Gameobject stuff in Unity within a running Coroutine. If the Coroutine runs without UI-related stuff (like yield return new Waitforseconds()), it seems to work. But of course this is essentiell for my application. As I said, I'm not sure about this, because I cannot think of a way these things are related.

mattem86 commented 6 years ago

It's weird, because the world video is written instantly, only the eye videos are delayed. So I thought the signal for terminating the recording is received correctly by pupil capture.

mattem86 commented 6 years ago

UPDATE: I GOT IT! It's not the activation of the gameobject itself, it's what is happening when this particular gameobject is active. Every iteration of the Update() method sends the position of this gameobject via NetMQ as an annotation ("notify.") ... in previous Pupillabs versions this was not a problem – it seems, that the v1.8 will block execution at some point as long as every annotation is received.

If the annotation capture plugin is running or not is not important, even if the "recording-flag" of the annotation is true or false: The writing of the world video timestamps is immediately after sending the recording stop – the eye videos timestamps are delayed in dependency of the duration of the simulation (the number of annotations so to speak). There are two gameobjects in my simulation which do this, so I end up sending 180 notifications per second (each with a <20 character string). Not much data, but somehow the eye processes are delayed on receiving the recording stop signal.

I hope this is understandable – if you want, I can fork the hmd-eyes so you can reproduce this.

I'm testing right now to use a different prefix than "notify" so maybe pupil will ignore these.

papr commented 6 years ago

Aha! So this line is at fault: https://github.com/pupil-labs/pupil/blob/master/pupil_src/launchables/eye.py#L463

It should be while instead of if!

mattem86 commented 6 years ago

Maybe, but this wasn't changed since v1.7. What was changed is this:
https://github.com/pupil-labs/pupil/blob/ca1a69cbda22c82d046a90863dd1d4e0d600bcd3/pupil_src/shared_modules/zmq_tools.py#L117

The zmq.POLLIN is new. The line you mentioned calls this!

mattem86 commented 6 years ago

If I did everything correctly, then sending the high frequency annotations through a separate publisherSocket with another prefix ("other.") does NOT fix the problem.

papr commented 6 years ago

I will fix the plugin and create a new example in the beginning of next week.

ckbaumann commented 6 years ago

To just copy notes by @papr for documentation:

Rewrite Annotation plugins such that

Goal: Same behavior as pupil_data_relay