ropensci / stats19

R package for working with open road traffic casualty data from Great Britain
https://docs.ropensci.org/stats19
GNU General Public License v3.0
62 stars 19 forks source link

Data quality question/issue #180

Open timcoote opened 4 years ago

timcoote commented 4 years ago

This is really a point about the data, but it impacts analyses done with it.

I believe that the speed limit column of accident data has errors in it. As an example, Accident Index '2009460170410' has a speed limit of 30. However, that Latitude, Longitude location is in a 40mph zone, which, I believe, dates back to the last millennium.

I realise that this issue is no under the repo's control, but I thought I document it/bring it to attention so that it can be confirmed/invalidated.

I have reported this to DfT.

Robinlovelace commented 4 years ago

Thanks for reporting the issue @timcoote. Agree there are quality issues with the data, and a few issues that have come to light in a public forum thanks to community input around this package (see #101 and #91 which we are close to acting on in #178 for example). It's good that you've reported the issue to the DfT. Our approach is to link to the DfT documentation and be only a provider of the data, staying faithful to the raw data. If you see opportunities to improve the documentation/code in any way please do let us know.

timcoote commented 4 years ago

It strikes me that there would be value in a document on data quality, highlighting issues as they become notices. This is quite a long timeseries and, I suspect, this project is going to expose it to many more eyes than have seen it in the past. #101 is a very good example of change over time, this one is, I suspect, an issue with the data never having been checked. (there may be a similar issue with Latitude and Longitude, or how they map onto Google's maps as I can see several accidents inside shops ;-) )

An early heads up on where the bear traps are will make life easier for newcomers to the data.

Such a document could also track interactions with DfT in improving their data governance and quality, demonstrating the value of the repo/opening up the data.

Robinlovelace commented 4 years ago

Such a document could also track interactions with DfT in improving their data governance and quality, demonstrating the value of the repo/opening up the data.

There is no specific suggestion here but I will keep this issue open in the hope that it encourages further feedback and Pull Requests from others, especially domain specialists who work with this data on a daily basis like analysts who work at Agilysis (hint ; ).

PRs welcome.

timcoote commented 4 years ago

If you just want a PR for a "Known Data Quality Issues" document, I could kick one of those off for you.

I did get a response from DfT: """ There might be differences between the reported speed limit and the actual speed limit of the road where the accident happened for the flowing reasons:

I'd already checked the second and third of these as not being the source of the error in at least one case.

timcoote commented 1 year ago

I fell into issues with #101 when trying to estimate changes in accident rates in areas, so I had a quick look at other years, and raised the issue of quality checking / updates with DfT, proposing that they support some sort of effort to post-process the data to iron out known issues/document what's left behind. (I'm noting this here as that issue is closed and this one is, I believe, left to track data quality issues).

I wanted to document that python's basemap package has too crude a resolution to be useful. Even at full resolution, approx ~70% of locations are falsely identifed as not is_land :-(

On the upside, 1999, et seq seem to have < 0.02% of locations in water, assuming that an elevation of 0.0 reported from open-elevation is a reasonable proxy for being in the water. It's not perfect, but it looks reasonable. A better check may be to look for locations out of police area, but I quickly gave up on that. If I get time, I may try to estimate the error rate up to 1998.

Robinlovelace commented 1 year ago

Great stuff @timcoote thanks for the updates, keep plugging away at it. Any updates / thoughts on implications for this package and how to make it better: v. welcome.

timcoote commented 1 year ago

@Robinlovelace one thing that did pop out when I looked at the 2022 data is the inclusion of vehicle information. So I was going to try a quick cross check on the number of licensed vehicles just to see if there are any obviously over-represented manufacturers/models/types. (this is probably the wrong thread for this comment, but it was the first that came to hand)

Robinlovelace commented 1 year ago

Sounds good to me, good luck with it and do keep us posted!