udacity / sdc-issue-reports

29 stars 6 forks source link

video suggests to shuffle time-series data -- this has un-intended consequences #403

Closed giladgressel closed 7 years ago

giladgressel commented 7 years ago

https://classroom.udacity.com/nanodegrees/nd013/parts/fbf77062-5703-404e-b60c-95b78b2f3f9e/modules/2b62a1c3-e151-4a0e-b6b6-e424fa46ceab/lessons/fd66c083-4ccb-4fe3-bda1-c29db76f50a0/concepts/9a67cfac-5392-4fa9-8ef9-5ac60b22c8f0

This video suggests that we should shuffle the images for train/test set. All subsequent practice quizzes use train_test_split to random create training /testing data from the images.

While this is generally a good practice for most ML problems, with time series photos (which this kind of data essentially is), this can have unintended consequences.

Assuming the images are taken from video feeds, this creates a situation where the exact same physical car is in both training/testing sets, but in slightly different positions. This causes the Classifier to overfit to the testing data, because the two very similar images exist in both testing and training.

see this post for an example: https://carnd-forums.udacity.com/questions/36061571/how-to-reduce-false-positives

This is also the same problem we had with the P2 data -- see here: https://carnd-forums.udacity.com/questions/26217085/major-gap-between-test-acc-and-validation-acc-in-keras-traffic-sign-notebook

In general I believe it to be best practice to manually create testing/validation sets from video images. by splitting based on the value of time (when was the image taken). Keeping videos of images together in training and not overlapping with the testing. Considering video images as time-series data, we cannot simply randomly shuffle in order to avoid "ordering issues", in fact it introduces ordering issues.

@dhruvp @swwelch @napratin

ryan-keenan commented 7 years ago

Hi Gilad, Thanks for reporting this! In this case, I've played with the data a lot and I'm fairly certain that the issue reported in the forum post for P5 is due to image scaling issues. Many people have seen this issue already and the solution is just to be sure images are scaled consistently when extracting feature vectors.

giladgressel commented 7 years ago

I understand that scaling the images might fix the particular problem posted for p5.

But you are ignoring the larger question which is "should we be using random splits when the data is sequentially gathered?"

This course is fundamentally teaching a bad practice. Should I open a new issue that specifically targets only the bad practice of train_test_split on this type of video data?

indradenbakker commented 7 years ago

I totally agree with Gilad. Even though it looks like it doesn't affect anything, this is really bad practise in ML (it can be seen as data leakage: test data is leaked into the training data, so it looks like your model is able to generalise while it actually isn't).

mxbi commented 7 years ago

I think Gilad raises a very good point here - as a rule in ML your train/validation split should mimic the train/test split as much as possible, as the use of a validation set is to understand how well a model will perform on a test set.

If the validation set is testing something different to the test set (i.e. measuring generalisation to different frames in the same timeseries vs measuring generalisation to a future timeseries), then IMO the entire point of having a validation set in the first place has been lost. It would be good to split the data based on time instead of randomly, to avoid the sort of problems people were having in P2. There have been many times that I have also been personally been bitten by bad validation splits outside of Udacity, so I think it's a good idea to teach good practice in the course.

ryan-keenan commented 7 years ago

Hi guys, thanks for all the insightful comments. I definitely understand your point. All the vehicle and non-vehicle images in the dataset provided for the quizzes in the classroom and for the project are drawn from frames of video, so the issue of sequences of images that are very similar to one another is a concern.

However, if you take a look at the subset of "vehicle" images used for the classroom quizzes, you'll find they are just a random couple thousand or so taken from the KITTI dataset and, while you might find the same vehicle pictured in there several times, the angles / lighting etc. are different enough to make make the time-series issue a minimal concern. I did take a closer look at the "non-vehicle" images in the classroom quiz subset and found some fraction (~5%) of images that were close enough to identical to be a concern for overfitting. I've removed these and this update will appear in the classroom soon.

As for the project dataset, the GTI data is a collection of time-series images where your concerns are certainly valid. However, GTI only makes up 30% of the total dataset. The remaining 70% of "vehicle" images are taken from KITTI, where the time-series issue not a serious concern. Non-vehicle images, GTI or otherwise, are also time-series from video but because they are not centered on any particular object of interest, the identical training images concern is minimal.

I have added the following warning/suggestion to the "Tips and Tricks for the Project" lesson regarding time-series images for this project (as well as a note in bold before the first classifier quiz):

Random shuffling of data

When dealing with image data that was extracted from video, you may be dealing with sequences of images where your target object (vehicles in this case) appear almost identical in a whole series of images. In such a case, even a randomized train-test split will be subject to overfitting because images in the training set may be nearly identical to images in the test set.

For the project vehicles dataset, the GTI* folders contain time-series data. In the KITTI folder, you may see the same vehicle appear more than once, but typically under significantly different lighting/angle from other instances.

While it is possible to achieve a sufficiently good result on the project without worrying about time-series issues, if you really want to optimize your classifier, you should devise a train/test split that avoids having nearly identical images in both your training and test sets. This means extracting the time-series tracks from the GTI data and separating the images manually to make sure train and test images are sufficiently different from one another.

giladgressel commented 7 years ago

Hi Ryan, I think that's a really well articulated warning/suggestion.

Can you add something similar for the P2 traffic signs data? Most people are using train_test_split there without understanding the implication. See this thread https://carnd-forums.udacity.com/questions/26217085/major-gap-between-test-acc-and-validation-acc-in-keras-traffic-sign-notebook

ryan-keenan commented 7 years ago

Hi Gilad, in fact, Brok is tackling that issue and is working today on constructing a better validation set for P2. I'll suggest also that he add a similar warning / suggestion in the lesson and project description there.