llui85 / olympics-tokyo2020

A scraper for the Tokyo 2020 Olympics designed to download "complete" data.
Mozilla Public License 2.0
9 stars 0 forks source link

There seems to be an issue in the format of the value of some sports #1

Open reallyyy opened 10 months ago

reallyyy commented 10 months ago

For example: Here is the data for "Women's 100m Breaststroke" Some participants's times is a much as 30 seconds, for context the value according to google is around 1:04 - 1:06 depending on the participants in question. eventTile value participantName Heat 2 31.77 Dalma Sebestyen Heat 2 31.86 Remedy Rule Heat 3 32.42 Erin Gallagher Heat 4 36.08 Benedetta Pilato Heat 2 40.94 Claudia Verdino

The values for "Men's Marathon" is also questionable. Here are some examples: eventTile value participantName Men's Marathon Final 10:09 Cameron Levins Men's Marathon Final 10:52 Ivan Zarco Alvarez Men's Marathon Final 11:28 Yuma Hattori Men's Marathon Final 12:07 Christian Pacheco Men's Marathon Final 12:07 Hassan Chahdi Men's Marathon Final 15:36 Stephen Scullion Men's Marathon Final 15:44 Mykola Nyzhnyk Men's Marathon Final 15:48 Lemawork Ketema Men's Marathon Final 16:12 Oleksandr Sitkovskiy The names of the participants are right but the values are wrong. You can't run a marathon in 10 minutes and 10 hours is too long. The average time is about 2 - 3 hours or so.

llui85 commented 10 months ago

Hi @reallyyy

All the data is what was returned from the OBS server at time of scraping, there was no processing or changes to the data made at all. There may well be inaccuracies, but that's concerning if there are. I have not audited the data or carried out sanity checks like you seem to have done.

Where are you finding this information? The way events are laid out is a little confusing; from memory there are many different data types for different sections of the event - SubEventUnit, Stage, Result, Phase, Event, and EventUnit. Is it possible that the data you're finding is only partial, perhaps for one section of measurements taken? (i.e the first 50 metres in a swimming race could be timed separately in different legs?)

The USDF messages may also be useful for troubleshooting.

reallyyy commented 10 months ago

Where are you finding this information?

disciplines = olympicsData["Discipline"] events = olympicsData["Event"] SportsData = [] for itemId, item in events.items(): disciplineId = item["relationships"]["discipline"]["data"]["id"] discipline = disciplines[disciplineId] if discipline["attributes"]["name"] in ["Athletics","Swimming","Weightlifting"]: SportsData.append({ "name": item["attributes"]["name"], "id": item["attributes"]["externalId"], "disciplineName": discipline["attributes"]["name"] })

print(json.dumps(SportsData,indent = 4))

- I use most of your code keeping the same format, for the most part
- The problem is that using the same code only a faction of the sports is wrong, in my case it's all related to sported related to timming. Not all of the timming sports are wrong, there are running sports with the right data for example: ```Men's 1500m running or Men's 5000m running```. And there are swimming sports where you have only a few rows are wrong for example: 

"Men's 200m Backstroke eventTile value participantName 57 Final 1:51.25 Kristof Milak 813 Final 1:53.27 Evgeny Rylov 60 Final 1:53.73 Tomoru Honda 790 Final 1:54.15 Ryan Murphy 539 Heat 4 1:54.44 Kuan-Hung Wang .. ... ... ... 842 Heat 1 2:17.40 Izaak Bastian 849 Heat 1 2:17.51 Julio Horrego 833 Heat 1 2:20.09 Arnoldo Herrera 822 Heat 1 2:23.22 Abdulaziz Al-Obaidly 684 Heat 5 32.11 Haiyang Qin


As you can see only the last row is wrong.
llui85 commented 10 months ago

Ah, I see what's happening.

For the men's marathon (event unit ID f0a359cc-d859-3865-a7e3-ab3b6f68eddf), there are 106 Competitor records, but 1014 Results. Side note: Don't match on externalId, use the relationships that already exist instead.

I've attached a CSV of this subset that should help you understand what's going wrong here immediately. mensmarathon.csv

The thing I think that's being missed here is that a Result is not final - there are unofficial, partial, and final official results. To quote from the ODF spec (the data from this repo is a parsed form of the ODF spec for the most part, although not always identical in structure)

The ‘Results’ message, DT_RESULT is the key message for all competition information and is available for every unit. This message is:

  • used to provide the start list before the start of the unit;
  • updated continuously throughout the unit with results; and
  • sent with the unofficial and official results when the unit is over.

So in this case:

For a marathon, there would be 10 intermediate frame results sent at different checkpoints. For 100m swimming, a frame would be sent for each lap, which matches up with the data that you were seeing.

llui85 commented 10 months ago

Also, I have data from the 2020 Paralympics & 2022 Bejing Olympics/Paralympics in the same format that I never got around to uploading to Kaggle, if you'd like it.

reallyyy commented 10 months ago

Wow thank you so much for the fast reply and spending time exploring the issue. I am so grateful for the support. This is my first time working with such a big dataset so I am somewhat lost. Now that you have said, I went back and check the code, and it's true as you said that I did make the mistake of assuming that for any sport the number of records and Results should be the same. . When I run my code For example Women's 4 x 100m Medley Relay, The sweden team in the final has like 10 differents data points.

Also, I have data from the 2020 Paralympics & 2022 Bejing Olympics/Paralympics in the same format that I never got around to uploading to Kaggle, if you'd like it. - I would like to if uploading doesn't take too much of your time. Because with all the data I have, I already have roughly what I needed. More data points will make the point more concerete. Otherwise just the support you gave me already is kind enough.