johnwalley / bumps-results

Cambridge and Oxford Bumps results
Creative Commons Attribution 4.0 International
4 stars 1 forks source link

Are individual crew results across multiple years a more complete record? #5

Open johnwalley opened 4 years ago

johnwalley commented 4 years ago

See http://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/ball/ball_m1t.txt for an example. In particular it contains data on whether a crew raced on a particular day.

johnwalley commented 11 months ago

See https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/linc/linc_m1e.txt and image

Lincoln I do not race for a few days (and drop down one place each day but the fact they didn't race is represented in the chart.

johnwalley commented 11 months ago

For the same chart https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/sjoh/sjoh_m1e.txt shows St John's I dropping out by a 999 entry.

Brasenose II (https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/bras/bras_m2e.txt) don't race for five days but have a position each day. They then race for one day before dropping out!

johnwalley commented 11 months ago

Which suggests there's merit in capturing on a per day and per crew basis whether the crew raced or dropped out?

mcshane-fire commented 11 months ago

I agree.

Some crews clearly drop out, never to race again. I'd assume the next day all crews behind them effectively start one station up, so there isn't a gap left in the start order. Do you think this is correct? Either that division has one fewer racing crews or one crew gets moved up from each division so the bottom division has one fewer crew. I think this happened relatively recently in Cambridge with a crew getting a penalty that resulting in them not racing again, but I'd need to find out when that was (or if I imagined in). In the Eights results as soon as they have more than one division (1874) I can't see any higher division crews not racing each day.

If a crew doesn't race that day, but does race later on, I'd assume the same happens when they didn't race - all crews behind start one station up, so there isn't a gap left. The Eights rules changed in 1841, from then on you lose one place each day you don't race, previously you went to the bottom. This is shown differently from when a crew never races again - we see where the crew would have been in the start order, but if the crew withdraws we end their line with a dot. I'm assuming when a crew doesn't race that day, we don't know whether they'll be coming back later, it's only when racing is completed we would know whether we should show their virtual position in a bumps chart. In Anu's text results, this means each day a crew doesn't race we don't know whether to put 999 position results, or track +1 each day with a -1 in the flags column, until the end of the set.

I'm not sure whether this a problem or not. I'd almost prefer to represent the two cases the same way, so that even if the crew doesn't race again you still show their virtual start position. That way you can do the results incrementally each day and not change previous results based on the next day of racing. If that's the way to do it, then for the tg_format you need a variant of the 'e' code, which means move this crew exactly this many places but they didn't actually race, so something like 'v' for they virtually moved places.

mcshane-fire commented 11 months ago

Having tried to do this, I run into a problem - I end up with Lincoln ahead of Christ Church II on the last day, whereas the chart above is the other way around. I can't quite figure out where the difference comes from. So maybe I should abandon this idea, and just think how to reproduce the kind of chart above...

mcshane-fire commented 11 months ago

So using 'x' as a code for withdrawing completely, and 'v' for the virtual move but not racing that day, here's the result I get:

eights1847_men

eights1847_men.txt

mcshane-fire commented 11 months ago

I think the earliest ad_results file in this repo for Eights is 1892, and Torpids is 1900. Is there a reason for not having the earlier ones? In https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/e1847/e1847m.txt we don't have the flags to indicate which days should be indicated as not raced (ie Brasenose II is shown as '0 1 0 0 2 1 2 -99'. Is this why you didn't include them?

If the 'not raced' flags are only in the per-crew history then do we need to use these are the source and generate the tg_results data from that? For example, https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/bras/bras_m2e.txt has the full information: 1847 8 15 15 14 14 14 12 11 9 999 -1 -1 -1 -1 -1 0 0 0

mcshane-fire commented 11 months ago

If that's the case, we need to get all the per-crew stats files, and read them all. Do you have them all downloaded and available somewhere, or do we need to crawl Anu's site and download everything?

johnwalley commented 11 months ago

Lots of food for thought!

As for why the results in this repo stop at 1892 for Eights and 1900 for Torpids. 1891 Eights is the first event with crews dropping out and I haven't written any parsing/exporting code to handle that behaviour. For Torpids I think I just stopped at a round number. I've just pushed up Torpids results back to 1880 until I encounter the first crew dropping out.

johnwalley commented 11 months ago

I've pulled down the per crew stats files just in case we want to make use of them.

dudhia_per_crew_stats_files.zip

mcshane-fire commented 11 months ago

Okay, thanks. I'll have a look at whether I can automatically create results files for all years from this data. It might be quicker just to manually write out each year, since most of the tricky years (<1890) only have a single division per year, but the challenge of doing it automatically is an interesting one!

mcshane-fire commented 11 months ago

I've written the tool to read these files and turn them into sets of results, including coping where crews drop out or skip races. I currently have three issues:

1) There are some sets of results where multiple crews are listed with the same position on some days, currently https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/ is inaccessable, so I can't check why this has happened. The sets that have issues are Torpid Men's races in 1969 (Merton 2 & Merton 3), 1978 (Hertford 3 & Magdalen 2, St Johns 3 and Worcester 2) and 1981 (Bailliol 3 and LMH).

2) The next issue is that quite a few sets of bumps don't have all the crews listed. Some are just missing a few crews, so perhaps there are some files that weren't available or you didn't include in the zip file, other have most of the results missing (Torpids 1995 for instance, both men's & women's results) which I'm guessing is a different problem.

3) The final problem is that these per-crew files don't include the division size information for each set of races, I think adding that is going to be a manual process, which should be fairly easy once Anu's site is back in action.

mcshane-fire commented 11 months ago

Problem 1:

1969: Merton 2nd and 3rd boat have the same data. The ad_results for 1969 don't list Merton 2, just the 1st and 3rd boat. So I think in the ad_results file Merton 3 needs renaming as Merton 2, and in the per-crew results the entry for Merton 3 needs the 1969 line removing (or changing to 999). I have changes that fix this.

1981: Taking the ad_results data as correct the problem is with the per-crew results file for LMH, I have a change that fixes this.

1978: I think the ad_results for this set is wrong - for many crews the results for the last three days are added together and the total is put as the second day change, with the last two days as wrong. The top of the first division also doesn't match between the two sources of data, but the other way around (ad_results file has Keble going up one each day, the per-crew file has them going up two on the first and third day. So it's not clear how to fix this yet.

mcshane-fire commented 11 months ago

Problem 2:

Examples of a missing files from recent results: (I can't see these files on https://web.archive.org/web/*/eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/*)

Examples of missing files from old results: (again I can't see any evidence of per crew files)

Assuming that we only really care about regenerating results from <1900, then I think we only need to worry about these older colleges, rather than the lower boats from the first list, which can probably be done manually unless you have more data stashed away somewhere!

johnwalley commented 11 months ago

I'll have more time to devote to this next week but here are my thoughts (I'm reading your posts in order).

There are some sets of results where multiple crews are listed with the same position on some days, currently https://eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/ is inaccessable, so I can't check why this has happened. The sets that have issues are Torpid Men's races in 1969 (Merton 2 & Merton 3), 1978 (Hertford 3 & Magdalen 2, St Johns 3 and Worcester 2) and 1981 (Bailliol 3 and LMH).

Hmm, yes, I see the same.

The next issue is that quite a few sets of bumps don't have all the crews listed. Some are just missing a few crews, so perhaps there are some files that weren't available or you didn't include in the zip file, other have most of the results missing (Torpids 1995 for instance, both men's & women's results) which I'm guessing is a different problem.

Probably a few things going on here. I definitely didn't get every file in my download. I went through every college listed in the 'contents' section and attempted to get up to ten boats. Looking at the men's 1995 Torpids I missed Manchester and possibly Templeton (if they are considered different to Green Templeton).

The 1995 results also look like not much racing happened full stop.

The final problem is that these per-crew files don't include the division size information for each set of races, I think adding that is going to be a manual process, which should be fairly easy once Anu's site is back in action.

I agree it'll end up being a manual process. Which reminds me to schedule some time to input some older Cambridge results.

Problem 1:

1969: Merton 2nd and 3rd boat have the same data. The ad_results for 1969 don't list Merton 2, just the 1st and 3rd boat. So I think in the ad_results file Merton 3 needs renaming as Merton 2, and in the per-crew results the entry for Merton 3 needs the 1969 line removing (or changing to 999). I have changes that fix this.

👍

1981: Taking the ad_results data as correct the problem is with the per-crew results file for LMH, I have a change that fixes this.

👍

1978: I think the ad_results for this set is wrong - for many crews the results for the last three days are added together and the total is put as the second day change, with the last two days as wrong. The top of the first division also doesn't match between the two sources of data, but the other way around (ad_results file has Keble going up one each day, the per-crew file has them going up two on the first and third day. So it's not clear how to fix this yet.

Well spotted that the second day looks like the sum of the last three days! Yes, the ad_results file looks broken. It would be nice to have a third source of results. I might be mis-remembering this, but did The Times print results at some point?

Problem 2:

Examples of a missing files from recent results: (I can't see these files on https://web.archive.org/web/*/eodg.atm.ox.ac.uk/user/dudhia/rowing/bumps/*)

Torpids, St John's 3rd women Torpids, St Anne's 3rd women Torpids, Jesus 3rd women Torpids, Balliol 3rd women Torpids, Wadham 4th men Torpids, St Catherines 3rd men Torpids, Somerville 2nd men

Examples of missing files from old results: (again I can't see any evidence of per crew files)

Magdalen Hall (e.g. 1838) St Mary Hall (e.g. 1871)

Assuming that we only really care about regenerating results from <1900, then I think we only need to worry about these older colleges, rather than the lower boats from the first list, which can probably be done manually unless you have more data stashed away somewhere!

I don't have anything extra to hand but I'm up for some manual data entry!

mcshane-fire commented 11 months ago

Okay, I've got to a point where things are basically working and manual checking & data entry are the next steps. I've created https://github.com/mcshane-fire/bumps and pushed my current code to there. I need to add some documentation but the rough gist is:

Write a bumps chart: ./harness.py -w output.svg <name of tg_format file>

Convert an 'ad_format' results chart to a 'tg_format' results chart: ./convert -ad <input ad_format filename> <output tg_format_filename>

Read all the per-crew data files, and try to generate all the tg_format files: ./convert -pc <input directory containing all the files> <output directory to write all the files into>

The bumps.py file has support for the new results codes (v, x, w, d). The final 'convert' commant is what I've been doing recently, it guesses which per-crew information is missing and writes this into a 'missing.txt' output file. This is of the format to add to the escapes.py file, which is read when that convert command is run, so we can iteratively add stuff to make the output more correct. We'll get to a point where we've got a set of tg_format files with mostly correct output, we then need to go through the charts one by one and correct any mistakes.

mcshane-fire commented 11 months ago

I've added another option to the last command, if you give it a third directory it doesn't output any results files for ones that it finds in that third directory. So I can now run ./src/convert -pc ox_per_crew ox_output results (where ox_output is a temporary new directory) and it will only write results files if it believes it's got all the crew information (either from the ox_per_crew files, or from the escapes.py file) and that file doesn't exist in the results directory. It currently generates 114 files, from experience I think they all need manually checking since Anu's charts indicate that crews didn't race on days where the per-crew results files don't have the -1 flag. For most results we also need to division size information.

My suggestion is that we can both generate this set of 114 candidate files, once we have verified that the results file is correct, commit it into the results directory, so that next time it won't be generated. We can also investigate the missing.txt file, and add crew information into escapes.py so that it will generate more candidate files next time around.