stappit / berlin-marathon

Some scripts to scrape and clean berlin marathon data.
4 stars 4 forks source link

split data #1

Open jabadia opened 6 years ago

jabadia commented 6 years ago

hi,

great job here! have you tried getting the 5km split times? any luck with that?

thanks!

stappit commented 6 years ago

Hi,

The split times are not available as far as i can tell. The API is fairly hidden and not documented at all though. Please let me know if you find anything.

Also, here's an analysis i made of the data http://stappit.github.io/posts/berlin_marathon/age.html

Cheers

On Thu, 6 Sep 2018, 00:00 Javier Abadía, notifications@github.com wrote:

hi,

great job here! have you tried getting the 5km split times? any luck with that?

thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stappit/berlin-marathon/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AMukHIiErgvf6g-p6J2n7NL-W-rddvJQks5uYEmNgaJpZM4WbxH0 .

jabadia commented 6 years ago

so cool!

Well I poked around the web page a bit and discovered that they show the split times when you click on any runner. So there is an API to get the splits, but unfortunately It will give you split times for one runner at a time.

I wrote this function to scrape them. However, the problem is that there are tens of thousands of runners. I restricted my scraping to the subset of runners I'm interested in (age_class=40) for only one year (I modified your scrape.py to get only 2017 results).

Also, I used grequests that is a wrapper around requests that allows sending many requests in parallel. I tried different concurrency values and found that you can go up to around 100 requests, and that speeds the process a lot. In my development environment it took 12 min approx to scrape 6.5K runners.

import time

import grequests
import pandas as pd
from lxml import etree

def download_split_times(dirty_filename, splits_file):
    df = pd.read_csv(dirty_filename).sort_values(['year', 'id']).reset_index(drop=True)

    df['net_minutes'] = pd.to_timedelta(df['net_time']) / pd.Timedelta(minutes=1)  # convert time in HH:MM:SS to a float representing minutes
    df['clock_minutes'] = pd.to_timedelta(df['clock_time']) / pd.Timedelta(minutes=1)

    # print(df.head())

    participants_by_age_class = df.age_class.value_counts().sort_index()
    print(participants_by_age_class)

    # today, I'm interested only in participants in my same age class
    same_age_participants = df[df['age_class'] == '40'].reset_index()
    print(same_age_participants.shape)

    # we need to send one request per participant :-(
    # like this one https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php?t=BM_2017&m=d&pid=11995
    url = "https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php"
    params = {
        't': 'BM_2017',     # common params
        'm': 'd'
    }
    rs = [
        grequests.get(url, params=dict(params, pid=participant_id))  # combine with participant id
        for participant_id in same_age_participants.id
    ]

    t0 = time.time()
    results = grequests.imap(rs, size=100)  # size = max parallel requests, don't put too many
    count = 0
    for result in results:
        doc = etree.HTML(result.text)  # unfortunately the response is an html fragment we need to parse
        split_time_headers = doc.xpath("//div[@class='gridResultsDetailHead']")
        split_headers = [element.text for element in split_time_headers]  # ['5 km', '10 km', '15 km', '20 km', '21,1 km' ... '40 km']
        split_time_divs = doc.xpath("//div[@class='gridResultsDetailBody']")
        split_times = [element.text for element in split_time_divs]  # ['00:14:29', '00:29:04', ...]

        participant_id = int(result.request.url.split('&pid=')[-1])
        count += 1
        print(count, participant_id, ' '.join(split_times))
        for distance, split_time in zip(split_headers, split_times):
            same_age_participants.loc[same_age_participants.id == participant_id, distance] = \
                pd.Timedelta(split_time) / pd.Timedelta(minutes=1)  # store a float with minutes

    t1 = time.time()
    print("time taken: %.1f sec" % (t1-t0,))
    same_age_participants.to_csv(splits_file, index=False)

if __name__ == '__main__':
    download_split_times('data/berlin_marathon_times_dirty.csv', 'data/with_split.csv')