ropensci / stats19

R package for working with open road traffic casualty data from Great Britain
https://docs.ropensci.org/stats19
GNU General Public License v3.0
61 stars 19 forks source link

Some police forces used alternative grid references for eastings and northings ~1979-1981 and 1986 #101

Open cmcaine opened 5 years ago

cmcaine commented 5 years ago

E.g. Hounslow isn't really in the sea to the west of Glasgow.

screen

The London Boroughs and some other geographical areas did this. If we can find out what system they were using we could fix this.

1986, looks like a CRS issue too, but I don't really know what's going on.

We can also observe a lot of other errors with the early geocoding in this sequence of images:

plot001

Source code:

# smaller_s19 is stats19 1979-2004, serious and fatal only, drop NAs on coordinates
year_maps = smaller_s19 %>%
  mutate(year = lubridate::year(date)) %>% 
  st_as_sf(coords=c("location_easting_osgr", "location_northing_osgr"), crs = BNG) %>%
  qtm() + tm_facets(along="year");

# And copy the image files out of /tmp before this finishes ;)
tmap_animation(year_maps, "stats19-osgr-locations-over-time.mpg")
Robinlovelace commented 5 years ago

Well found @cmcaine. At the request of @mem48 I recall we add a warning saying that locations may not be accurate before 2005. It would be amazing if we could rectify the issues in the code. I think that analysing the crashes with clearly errant points could lead to a solution. One question: do the errors also affect the longitue and latitude columns? Rarely use data before 2005 but clearly it's very important so I (and I imagine other users of the data) am very grateful to you for raising this issue.

cmcaine commented 5 years ago

There's no longitude or latitude data at all until 1999, by which time the eastings and northings look accurate anyway.

Some number of these will be transcription errors, but others (London) certainly look like systematic use of an alternative (or truncated) grid reference system, so I think there is a chance of fixing those.

Another challenge is that the data for all accidents has fewer obviously missing areas:

image

It seems unlikely that these places really had no serious or fatal collisions in a year. Perhaps police forces used a different system for recording more serious crashes in those areas?

mem48 commented 5 years ago

We had this problem with the cyipt project, some of the older data uses less precise grid references and lots of data ended up in the sea.

The british national grid has not change since 1936, and I'm not aware of any regional grids in the UK. So I suspect there is some add hoc truncation, where ploice have left out inital few didgits they would always be the same in their area of intrest.

mem48 commented 5 years ago

@cmcaine I've had a deeper dive and this does not seem to be a simple scaling problem. I can't figure out how the coordinates are supposed to map to the BNG. I suggest making an equiery with the DFT they may have some historical context that we are missing.

cmcaine commented 5 years ago

Thanks for looking at it, Malcolm.

Sent to DfT:

Hello,

As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I also note that some areas, including much of Scotland, Wales and the North West appear to have reported "Slight" accidents/collisions to STATS19 but not "Serious" or "Fatal" collisions. This is also illustrated in the linked issue.

I would appreciate your help in identifying: 1) if this omission is documented; 2) if data on serious and fatal collisions in these locations is available elsewhere.

Cheers, Colin Caine PhD Student, School of Geography, University of Leeds

Robinlovelace commented 5 years ago

Just picking up the thread on this after getting back from holiday yesterday. I suspect there are some systematic errors that can be fixed, and likely some random errors that cannot. I think asking the DfT is a good plan (have you heard anything @cmcaine? can follow up if not) and, if they don't know either, would suggest a collaborative project aimed at doing an even deeper dive than @mem48 did to identify dodgy coordinates (please share analysis code you used for this if you have it).

Quantifying (e.g. range, standard deviation) and plotting differences between expected (based on recent data) and recorded Easting and Northing distributions for each force/year combination in which dodgy coordinates are found should help at least identify the region/years in which there is a pattern to the error.

cmcaine commented 5 years ago

I have not received any response yet. Please feel free to follow up, I don't have any contacts at the DfT. Contacting the metropolitan police or the Mayor's Office for Policing and Crime[1] might be sensible, to.

It's pretty clear from the maps that it's mostly crime in London that is systematically miscoded in the first two years.

[1]: https://www.london.gov.uk/what-we-do/mayors-office-policing-and-crime-mopac/governance-and-decision-making/mopac-decisions--71

On Sat, 17 Aug 2019 at 16:26, Robin notifications@github.com wrote:

Just picking up the thread on this after getting back from holiday yesterday. I suspect there are some systematic errors that can be fixed, and likely some random errors that cannot. I think asking the DfT is a good plan (have you heard anything @cmcaine https://github.com/cmcaine? can follow up if not) and, if they don't know either, would suggest a collaborative project aimed at doing an even deeper dive than @mem48 https://github.com/mem48 did to identify dodgy coordinates (please share analysis code you used for this if you have it).

Quantifying (e.g. range, standard deviation) and plotting differences between expected (based on recent data) and recorded Easting and Northing distributions for each force/year combination in which dodgy coordinates are found should help at least identify the region/years in which there is a pattern to the error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/stats19/issues/101?email_source=notifications&email_token=ABNZA6IKZ5WZ6POZGRYVGRLQFAKB5A5CNFSM4IKCO4N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QNWOI#issuecomment-522246969, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNZA6L3UUGEK66TUR47WCLQFAKB5ANCNFSM4IKCO4NQ .

mem48 commented 5 years ago

I couldn't find any pattern in the London Data. It wasnt even rougly in the shape of london, so I think we are going to need expert help

cmcaine commented 5 years ago

I think it should be discoverable, though. 1981 was only 39 years ago. I'm sure the met police or mopac could reach out to some retired officers for us if they felt like it.

I've sent a similar email to the ONS as well and attached a sample CSV of the eastings and northings.

I attach here a zip of a CSV of all of the eastings and northings for 1979-1981 in case anyone wants to get the data without using R (mostly for the convenience of our external friends). Each observation includes the easting and northing, local authority district name, road class and road number.

stats19_osgr_1979_1981.zip

The exact text of the email sent to the ONS is:

To: universityenquiries@os.uk Subject: Help interpreting unusual grid references in DfT STATS19 data

Hello,

As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I attach a zipped CSV sample of 50k (out of ~750k) observations from the stats19 dataset 1979-1981 for your convenience. A zipped csv containing all the rows is linked from the github issue. Each observation includes the easting and northing, local authority district name, road class and road number.

Cheers, Colin Caine PhD Student, School of Geography, University of Leeds

cmcaine commented 5 years ago

Perhaps the coordinates just assume that they're on the OS map for their particular area of London.

If there were enough different maps for different areas of london then the shape of London would not be scaled and recognisable.

mem48 commented 5 years ago

That is possible, there may also have been a conversion error that scrambled the data. For example BNG coordinates can be stored like this TQ1234 if these had been convered to numbers incorrectly the may have become garbled.

mem48 commented 5 years ago

I plotted all the crashed on the A4 in 1979

A4

There is some random point but clearly a road across the map. The intresting bit is the bottom right which suggests more than one coordinate system is in use.

layik commented 4 years ago

Just catching up with this, I think one technical issue is also format_sf fails on output of get_stats19 for those years. I can take this to a new ticket if I am right:

Reprex to show all london accidents return empty using `format_sf`: ``` r dd = "~/code/saferactive/ignored/" acc7904 = stats19::get_stats19(1979, data_dir = dd) #> No files of that type found for that year. #> This will download 240 MB+ (1.8 GB unzipped). #> Coordinates and other variables may be unreliable in these datasets. #> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102 #> Files identified: Stats19-Data1979-2004.zip #> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data1979-2004.zip #> Data already exists in data_dir, not downloading #> Data saved at ~/code/saferactive/ignored//Stats19-Data1979-2004/Vehicles7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/code/saferactive/ignored//Stats19-Data1979-2004/Casualty7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv #> No files of that type found for that year. #> This will download 240 MB+ (1.8 GB unzipped). #> Coordinates and other variables may be unreliable in these datasets. #> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102 #> Reading in: #> /home/layik/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv #> date and time columns present, creating formatted datetime column # acc7904 = stats19::format_sf(acc7904, lonlat = TRUE) l = acc7904[acc7904$local_authority_district == "London", ] nrow(l) #> [1] 638746 l = stats19::format_sf(l, lonlat = TRUE) #> 638746 rows removed with no coordinates #> Warning in min(cc[[1]], na.rm = TRUE): no non-missing arguments to min; #> returning Inf #> Warning in min(cc[[2]], na.rm = TRUE): no non-missing arguments to min; #> returning Inf #> Warning in max(cc[[1]], na.rm = TRUE): no non-missing arguments to max; #> returning -Inf #> Warning in max(cc[[2]], na.rm = TRUE): no non-missing arguments to max; #> returning -Inf nrow(l) == 0 # TRUE #> [1] TRUE ``` Created on 2020-07-01 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)

In terms of a systematic error or help with this ticket, I will also send an email to our tech contacts in DfT and cc @Robinlovelace and invite them to contribute here if possible.

layik commented 4 years ago

I sent an email to our tech contacts in DfT too @cmcaine and @mem48. Will update this ticket if I dig anything out. Great data analysis/insights.

wengraf commented 3 years ago

Just come to this - just terrible geo-coding, and no error checking at the time, not an alternative CRS. Unless you can match to main road name and work from that, just learn to live with it. I'd be wary of the idea of "correction" too....

Robinlovelace commented 3 years ago

I think different conventions were used in different forces. Confident there are ways to improve on the assumption of 'bog standard' 27700 (e.g. by dividing coords by 10) for some places, but not a priority!

wengraf commented 3 years ago

A lot would have been found on an old A-Z, then roughly guessed on a paper Landranger, with only the vaguest idea about eastings and northings. Stats19 has duff fields, at points in time, that’s something people have to just come to terms with.

wengraf commented 3 years ago

Some will have been filled with meaningless numbers, like 0,0, just so it passed the check for a filled in field.

Robinlovelace commented 3 years ago

Good point Ivo.

layik commented 3 years ago

Can we also close this? As we cannot offer any useful solutions to the issue. Use of road names etc are all outside the main issue. I say we close it.

Robinlovelace commented 3 years ago

I think we can close this. We've raised the issue and even give the user a message telling them to watch out. Good suggestion, thanks @layik.

stats19::get_stats19(year = 1979)
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Files identified: Stats19-Data1979-2004.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data1979-2004.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ~/stats19-data/Stats19-Data1979-2004/Vehicles7904.csv~/stats19-data/Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/stats19-data/Stats19-Data1979-2004/Casualty7904.csv~/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Reading in:
#> /home/robin/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> date and time columns present, creating formatted datetime column
#> # A tibble: 6,224,198 x 33
#>    accident_index location_eastin… location_northi… longitude latitude
#>    <chr>                     <int>            <int>     <dbl>    <dbl>
#>  1 197901A11AD14                NA               NA        NA       NA
#>  2 197901A1BAW34            198460           894000        NA       NA
#>  3 197901A1BFD77            406380           307000        NA       NA
#>  4 197901A1BGC20            281680           440000        NA       NA
#>  5 197901A1BGF95            153960           795000        NA       NA
#>  6 197901A1CBC96            300370           146000        NA       NA
#>  7 197901A1DAK71            143370           951000        NA       NA
#>  8 197901A1DAP95            471960           845000        NA       NA
#>  9 197901A1EAC32            323880           632000        NA       NA
#> 10 197901A1FBK75            136380           245000        NA       NA
#> # … with 6,224,188 more rows, and 28 more variables: police_force <chr>,
#> #   accident_severity <chr>, number_of_vehicles <int>,
#> #   number_of_casualties <int>, date <date>, day_of_week <chr>, time <chr>,
#> #   local_authority_district <chr>, local_authority_highway <chr>,
#> #   first_road_class <chr>, first_road_number <int>, road_type <chr>,
#> #   speed_limit <int>, junction_detail <chr>, junction_control <chr>,
#> #   second_road_class <chr>, second_road_number <int>,
#> #   pedestrian_crossing_human_control <chr>,
#> #   pedestrian_crossing_physical_facilities <chr>, light_conditions <chr>,
#> #   weather_conditions <chr>, road_surface_conditions <chr>,
#> #   special_conditions_at_site <chr>, carriageway_hazards <chr>,
#> #   urban_or_rural_area <chr>,
#> #   did_police_officer_attend_scene_of_accident <int>,
#> #   lsoa_of_accident_location <chr>, datetime <dttm>

Created on 2020-12-03 by the reprex package (v0.3.0)

wengraf commented 10 months ago

I've looked back at my earlier comments, and perhaps age and fatherhood has mellowed me since then...is there any mileage to be had with reverse geocoding and LA polygon/road/secondary road/junction type etc? Perhaps it is an interesting undergrad or MSc project?