Addressing data loss during preprocessing

kobinabrandon commented 3 months ago

During implementation of the second custom indexer, it occured that 400,000 out of 2.4M arrivals and departures (400,000 of each) were deleted because the original data provided neither station IDs nor station names. These trips did however come with station coordinates. The procedure that I initially designed to prevent this used coordinate proximity to match the missing details of these trips with the corresponding known details in some of the remaining 2M trips. It was largely a failure, as it was only successful for between 8000 and 9000 departures, and it was laughably unsuccessful with arrivals.

Fixing this problem would not be difficult on a technical level because reverse geocoding could be used to obtain new addresses for the origins and destinations of these problematic trips. However, it will definitely be a very time consuming process (and that's assuming the geocoding API provider doesn't block me). Obviously, it will be easy to assign IDs to each address once this is completed. Unfortunately, implementing this workaround will result in some loss of uniformity in the style of station names, because of differences between those that Divvy provided, and those obtained from the reverse geocoding procedure (an aesthetic inconvenience but one nonetheless).

I would like to see this solved, however, as I have reason to believe (based on previous training runs) that for each dataset, these 400,000 trips are different enough from the others to create some additional variation in the training data, and reduce model variance.

iamshan794 commented 3 months ago

Hey Brandon, I came across your issue recently. I have a few suggestions and questions for you. How did you conclude that it was successful only for 8000-9000 departures? Do you think converting existing Divvy addresses->geocodes and geocodes->Common API addresses would work? I know that Nominatim can handle 30 million queries per day per user. If you can point to those 400,000 datapoints, I can work on fixing that. I'd much appreciate if you can provide more info on the datasets you have used for training.

kobinabrandon commented 2 months ago

@iamshan794 Thanks for reaching out. I wasn't actually expecting anyone to take any notice of my little repo here :). Sorry I'm only responding now. I just saw your message.

First of all, that number (between 8000 and 9000 departures) was first reported by line #276 of the station indexing script (check the feature pipeline) during an earlier run. With the current data (the company updates the data monthly), it will be a completely different number (for both departures and arrivals). The reason why you haven't seen it is due to a fault of mine. I accidentally deleted a line of code at line #376. As a result, the current version of the code never runs the matching function. So you never get the message that tells you how many stations had their details recovered. It was a small but serious oversight. I've already made the correction, and I'll soon push the code.

Do you think converting existing Divvy addresses->geocodes and geocodes->Common API addresses would work?

Yes, it will. That's actually what I meant by "reverse geocoding" in the second paragraph of my original post, though doing it the way that you describe it won't exactly work. Reverse-geocoding involves taking the coordinates, and changing them into addresses. I once attempted to use Photon to do this, but it was quite slow. I'll use Nominatim and see how it goes, as I haven't tested its performance.

kobinabrandon / Hourly-Divvy-Trip-Predictor

Addressing data loss during preprocessing #2