Open jabadia opened 6 years ago
Hi,
The split times are not available as far as i can tell. The API is fairly hidden and not documented at all though. Please let me know if you find anything.
Also, here's an analysis i made of the data http://stappit.github.io/posts/berlin_marathon/age.html
Cheers
On Thu, 6 Sep 2018, 00:00 Javier Abadía, notifications@github.com wrote:
hi,
great job here! have you tried getting the 5km split times? any luck with that?
thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stappit/berlin-marathon/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AMukHIiErgvf6g-p6J2n7NL-W-rddvJQks5uYEmNgaJpZM4WbxH0 .
so cool!
Well I poked around the web page a bit and discovered that they show the split times when you click on any runner. So there is an API to get the splits, but unfortunately It will give you split times for one runner at a time.
I wrote this function to scrape them. However, the problem is that there are tens of thousands of runners. I restricted my scraping to the subset of runners I'm interested in (age_class=40) for only one year (I modified your scrape.py
to get only 2017 results).
Also, I used grequests
that is a wrapper around requests
that allows sending many requests in parallel. I tried different concurrency values and found that you can go up to around 100 requests, and that speeds the process a lot. In my development environment it took 12 min approx to scrape 6.5K runners.
import time
import grequests
import pandas as pd
from lxml import etree
def download_split_times(dirty_filename, splits_file):
df = pd.read_csv(dirty_filename).sort_values(['year', 'id']).reset_index(drop=True)
df['net_minutes'] = pd.to_timedelta(df['net_time']) / pd.Timedelta(minutes=1) # convert time in HH:MM:SS to a float representing minutes
df['clock_minutes'] = pd.to_timedelta(df['clock_time']) / pd.Timedelta(minutes=1)
# print(df.head())
participants_by_age_class = df.age_class.value_counts().sort_index()
print(participants_by_age_class)
# today, I'm interested only in participants in my same age class
same_age_participants = df[df['age_class'] == '40'].reset_index()
print(same_age_participants.shape)
# we need to send one request per participant :-(
# like this one https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php?t=BM_2017&m=d&pid=11995
url = "https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php"
params = {
't': 'BM_2017', # common params
'm': 'd'
}
rs = [
grequests.get(url, params=dict(params, pid=participant_id)) # combine with participant id
for participant_id in same_age_participants.id
]
t0 = time.time()
results = grequests.imap(rs, size=100) # size = max parallel requests, don't put too many
count = 0
for result in results:
doc = etree.HTML(result.text) # unfortunately the response is an html fragment we need to parse
split_time_headers = doc.xpath("//div[@class='gridResultsDetailHead']")
split_headers = [element.text for element in split_time_headers] # ['5 km', '10 km', '15 km', '20 km', '21,1 km' ... '40 km']
split_time_divs = doc.xpath("//div[@class='gridResultsDetailBody']")
split_times = [element.text for element in split_time_divs] # ['00:14:29', '00:29:04', ...]
participant_id = int(result.request.url.split('&pid=')[-1])
count += 1
print(count, participant_id, ' '.join(split_times))
for distance, split_time in zip(split_headers, split_times):
same_age_participants.loc[same_age_participants.id == participant_id, distance] = \
pd.Timedelta(split_time) / pd.Timedelta(minutes=1) # store a float with minutes
t1 = time.time()
print("time taken: %.1f sec" % (t1-t0,))
same_age_participants.to_csv(splits_file, index=False)
if __name__ == '__main__':
download_split_times('data/berlin_marathon_times_dirty.csv', 'data/with_split.csv')
hi,
great job here! have you tried getting the 5km split times? any luck with that?
thanks!