Improved fuzzy matching with warning

theOehrly / Fast-F1

FastF1 is a python package for accessing and analyzing Formula 1 results, schedules, timing data and telemetry

https://docs.fastf1.dev

MIT License

2.29k stars 239 forks source link

Improved fuzzy matching with warning #579

Closed Casper-Guo closed 2 months ago

Casper-Guo commented 2 months ago

Added test cases to the fuzzy version of get_event_by_name and reimplemented some logic to avoid false positive warnings.

I recommend considering using RapidFuzz over thefuzz since it seems to be more widely adopted and actively maintained. The performance difference is not important for our use case. The API is identical so this only requires changing the requirements. I will put it in a separate PR if you like the idea.

theOehrly commented 2 months ago

I recommend considering using RapidFuzz over thefuzz since it seems to be more widely adopted and actively maintained. The performance difference is not important for our use case. The API is identical so this only requires changing the requirements. I will put it in a separate PR if you like the idea.

I'm fine with switching to rapidfuzz if the (apparently minor) API differences aren't relevant here.

theOehrly commented 2 months ago

Also, the test failure on 3.8 minimum versions is completely unrelated. That's a CI problem. I'm fixed that just now. If you want, you can rebase onto master. But it's not really necessary, I think.

Casper-Guo commented 2 months ago

I'm fine with switching to rapidfuzz if the (apparently minor) API differences aren't relevant here.

I will PR for that once this is merged

theOehrly commented 2 months ago

Test failures are caused by the changes here. The reference_laps_data fixture that is defined in conftest.py now returns data for a different event. Somehow fastf1.get_session(2020, 'Italy', 'R') now returns the "Eifel Grand Prix". I probably shouldn't have used "Italy" as identifier in a season with three Italian grand prix. But the result should at least be one of those three. The desired result in the fixture is the "Italian Grand Prix" (Round 8). We can switch to an explicit get_session(2020, 8, 'R') there. But the matching problem still needs to be fixed.

Casper-Guo commented 2 months ago

That's super old. Italy should be 100% matched to all the Italian races that season and we tiebreak between those. I'll look into why we get a German race instead.

Casper-Guo commented 2 months ago

I see the EventSchedule dataframe's index doesn't start at 0, probably to accommodate testing.

This is inside the fuzzy_match function

print(self.iloc[[0, 1]])

Output:

RoundNumber  Country   Location                                  OfficialEventName  EventDate            EventName   EventFormat  ...    Session4               Session4Date     Session4DateUtc Session5               Session5Date     Session5DateUtc F1ApiSupport
2            1  Austria  Spielberg  FORMULA 1 ROLEX GROSSER PREIS VON ÖSTERREICH 2020 2020-07-05  Austrian Grand Prix  conventional  ...  Qualifying  2020-07-04 15:00:00+02:00 2020-07-04 13:00:00     Race  2020-07-05 15:10:00+02:00 2020-07-05 13:10:00         True
3            2  Austria  Spielberg  FORMULA 1 PIRELLI GROSSER PREIS DER STEIERMARK... 2020-07-12   Styrian Grand Prix  conventional  ...  Qualifying  2020-07-11 15:00:00+02:00 2020-07-11 13:00:00     Race  2020-07-12 15:10:00+02:00 2020-07-12 13:10:00         True

Is this expected? I will assume yes and push a fix based on that for now.

theOehrly commented 2 months ago

I see the EventSchedule dataframe's index doesn't start at 0, probably to accommodate testing.

This isn't to "accommodate" testing. But it is an artifact of the fact that testing exists in the schedule. include_testing=False filters the testing events out, but slicing in Pandas doesn't reindex the DataFrame. So the index starts at a non-zero value then. I thought a bit about whether this behaviour is desired, or whether .get_event_schedule should explicitly reindex the DataFrame before returning it. There certainly are arguments for both sides. But I think in terms of compatibility of the data with itself and for consistency within FastF1, filtering out the data and retaining the original index is better. For example, the .pick_* methods on Laps don't reindex the data either. Additionally, this would potentially be a breaking change for some users. Therefore, I don't think we really want to change that behaviour.

If you don't have any further objections, I'll merge.

Casper-Guo commented 2 months ago

No more changes needed on my front. Please merge if it looks good to you

theOehrly commented 2 months ago

Great, this is overall a pretty nice improvement 👍

Casper-Guo commented 2 months ago

I will open another one to switch the fuzzing library