jolpica / jolpica-f1

Apache License 2.0
41 stars 1 forks source link

Incorrect lap times in 2023, Round 9, Austrian GP #13

Open theOehrly opened 6 months ago

theOehrly commented 6 months ago

Some lap times in the 2023 Austrian GP are incorrect. Specifically, the following combinations of driver/lap number have incorrect lap times:

alonso: 48
bottas: 46
de_vries: 46
gasly: 57
hamilton: 49
kevin_magnussen: 45
norris: 50
ocon: 46
piastri: 49
russell: 48
sainz: 54
stroll: 48
tsunoda: 37
zhou: 50

Notably, all these lap times are the personal best lap times of some other driver, and they were set on that exact lap.

Example

Alonso is listed with a lap time of "1:08.739" on lap 48. The correct time would be "1:09.634". But "1:08.739" was the fastest lap of Norris and Norris set that lap time on lap 48. Norris' lap time on lap 48 is correct, therefore, the lap time is duplicated, not swapped.

Guess

The fastest lap times are specially highlighted in the source PDF. I would assume that this might have been an error when parsing the PDF for old Ergast. The data likely was imported incorrectly from there in an old database dump. grafik What speaks against this theory is that this is correct in current production Ergast but it was seemingly never reported as an error there.

harningle commented 5 months ago

I haven't looked at Ergast code very carefully. There seems to be two sources for lap time: "Race History Chart" and "Race Lap Analysis", which is your screenshot.

My current parsing uses Race History Chart and the results are the same as Ergast csv database:

You can check my code at https://github.com/harningle/fia-doc/blob/main/parse_race_history_chart.py and https://github.com/harningle/fia-doc/blob/main/notebook/cross_validate.ipynb

harningle commented 5 months ago

As a side note, we can do a lot of automated sanity checks. E.g., we have pit stops as "P" in "Race Lap Analysis" and also in "Pit Stop Summary". The parsing results shall be the same from both PDFs. I feel like this can be a potential data quality test when we put the code in production

theOehrly commented 5 months ago

@harningle the data on current Ergast is correct. My speculation is that this was imported in an old database dump where it was incorrect. And given how PDFs are structured internally (or not really structured at all) and how the old Ergast PDF parser works, I think it may be possible that this was originally parsed incorrectly and then manually corrected at some point.

The double/sanity checks would certainly be great to have.

jolpica commented 5 months ago

I've found that this is because of incorrect data in the Ergast results table. In the new database scheme, we no longer duplicate laptimes and fastest laptimes, so we choose the fastest laptime instead of the laps table time when its available.

This query to the results endpoint for Fernando Alonso https://ergast.com/api/f1/2023/9/drivers/alonso/results.json Returns:

...
"FastestLap": {
  "rank": "4",
  "lap": "48",
  "Time": {
    "time": "1:08.739"
  },
...

Which is what is listed as Alonso's lap 48.

I'm not sure theres much that can be done about this until we are live and perform corrections of the Ergast data. Could you see if you confirm these findings?

theOehrly commented 5 months ago

@jolpica I agree, that seems to be the problem

And to correct my previous statement, this is NOT fixed in Ergast currently. The lap times are correct on the /laps endpoint. But the fastest lap returned by the /results endpoint is incorrect.

I also agree that it is probably best to hold back with fixing this, until we are independent of Ergast and import our own data.