Closed Casper-Guo closed 2 months ago
I recommend considering using RapidFuzz over thefuzz since it seems to be more widely adopted and actively maintained. The performance difference is not important for our use case. The API is identical so this only requires changing the requirements. I will put it in a separate PR if you like the idea.
I'm fine with switching to rapidfuzz if the (apparently minor) API differences aren't relevant here.
Also, the test failure on 3.8 minimum versions is completely unrelated. That's a CI problem. I'm fixed that just now. If you want, you can rebase onto master. But it's not really necessary, I think.
I'm fine with switching to rapidfuzz if the (apparently minor) API differences aren't relevant here.
I will PR for that once this is merged
Test failures are caused by the changes here. The reference_laps_data
fixture that is defined in conftest.py
now returns data for a different event.
Somehow fastf1.get_session(2020, 'Italy', 'R')
now returns the "Eifel Grand Prix". I probably shouldn't have used "Italy" as identifier in a season with three Italian grand prix. But the result should at least be one of those three.
The desired result in the fixture is the "Italian Grand Prix" (Round 8). We can switch to an explicit get_session(2020, 8, 'R')
there. But the matching problem still needs to be fixed.
That's super old. Italy should be 100% matched to all the Italian races that season and we tiebreak between those. I'll look into why we get a German race instead.
I see the EventSchedule dataframe's index doesn't start at 0, probably to accommodate testing.
This is inside the fuzzy_match
function
print(self.iloc[[0, 1]])
Output:
RoundNumber Country Location OfficialEventName EventDate EventName EventFormat ... Session4 Session4Date Session4DateUtc Session5 Session5Date Session5DateUtc F1ApiSupport
2 1 Austria Spielberg FORMULA 1 ROLEX GROSSER PREIS VON ÖSTERREICH 2020 2020-07-05 Austrian Grand Prix conventional ... Qualifying 2020-07-04 15:00:00+02:00 2020-07-04 13:00:00 Race 2020-07-05 15:10:00+02:00 2020-07-05 13:10:00 True
3 2 Austria Spielberg FORMULA 1 PIRELLI GROSSER PREIS DER STEIERMARK... 2020-07-12 Styrian Grand Prix conventional ... Qualifying 2020-07-11 15:00:00+02:00 2020-07-11 13:00:00 Race 2020-07-12 15:10:00+02:00 2020-07-12 13:10:00 True
Is this expected? I will assume yes and push a fix based on that for now.
I see the EventSchedule dataframe's index doesn't start at 0, probably to accommodate testing.
This isn't to "accommodate" testing. But it is an artifact of the fact that testing exists in the schedule. include_testing=False
filters the testing events out, but slicing in Pandas doesn't reindex the DataFrame. So the index starts at a non-zero value then.
I thought a bit about whether this behaviour is desired, or whether .get_event_schedule
should explicitly reindex the DataFrame before returning it. There certainly are arguments for both sides. But I think in terms of compatibility of the data with itself and for consistency within FastF1, filtering out the data and retaining the original index is better. For example, the .pick_*
methods on Laps
don't reindex the data either. Additionally, this would potentially be a breaking change for some users. Therefore, I don't think we really want to change that behaviour.
If you don't have any further objections, I'll merge.
No more changes needed on my front. Please merge if it looks good to you
Great, this is overall a pretty nice improvement 👍
I will open another one to switch the fuzzing library
Added test cases to the fuzzy version of
get_event_by_name
and reimplemented some logic to avoid false positive warnings.I recommend considering using RapidFuzz over thefuzz since it seems to be more widely adopted and actively maintained. The performance difference is not important for our use case. The API is identical so this only requires changing the requirements. I will put it in a separate PR if you like the idea.