Differences between provided dataset and repo code

ndrplz / dreyeve

[TPAMI 2018] Predicting the Driver’s Focus of Attention: the DR(eye)VE Project. A deep neural network learnt to reproduce the human driver focus of attention (FoA) in a variety of real-world driving scenarios.

https://arxiv.org/pdf/1705.03854.pdf

MIT License

99 stars 33 forks source link

Differences between provided dataset and repo code #7

Closed Amakri1020 closed 5 years ago

Amakri1020 commented 5 years ago

Hi, I'm trying to train this model from scratch using the dataset provided, however it seems the dataset provided doesn't quite match what the code requires, e.g. it has avi files instead of frame jpegs so when I try to run:

python2 train.py --which_branch image,

I end up with an error like:

ValueError: Provided path "/home/amakri/DREYEVE_DATA/23/frames/004465.jpg" does NOT exist.

Is there code somewhere in this repo that I've missed which does this sort of preprocessing and sets up the dataset to be run by the code? Or are these things I will just have to do myself?

DavideA commented 5 years ago

Hi @Amakri1020 and thank you for interest.

Yes, in order to retrain the network, you should unroll all the .avi sequences (video_garmin.avi and video_saliency.avi) into frames. The code assumes the following structure:

//frames/.jpg --> for input frames //saliency_fix/.png --> for fixation maps As mentioned in [this issue](https://github.com/ndrplz/dreyeve/issues/6), you should comment out lines of code trying to load annotations from the `saliency` subfolder (`saliency_fix` is the one). Let me know if this helps, Best, D

Amakri1020 commented 5 years ago

Thanks for the quick response, that is helpful!

I am also curious if there were instances where the driver looked at something out of the FoV of the camera and if so, how did you deal with these cases?

DavideA commented 5 years ago

That's an interesting question :)

It is likely that during the recording a driver took quick peeks outside the FoV (e.g, looking at side mirrors). Anyway, the effect of rapid shifts in attention is ameliorated by the fixation map construction procedure. Indeed, as mentioned in the journal paper, such a procedure involves a temporal aggregation of fixation points to build a single fixation maps.

Short answer: we don't deal with such cases. I don't think these situations are encoded in fixation maps in the first place. You could still get them by looking at the ETG videos and the raw fixation recordings. But I'm not sure.