saferactive / trafficalmr

R package to support road safety and traffic calming measures
https://saferactive.github.io/trafficalmr/
GNU General Public License v3.0
5 stars 1 forks source link

tc_join_stats19 should be more flexible #24

Closed Robinlovelace closed 3 years ago

Robinlovelace commented 4 years ago

It should be able to return data at the casualty level. Example below:

remotes::install_github("saferactive/trafficalmr")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'trafficalmr' from a github remote, the SHA1 (bce0785a) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(trafficalmr)
nrow(crashes_wf)
#> [1] 3449
nrow(casualties_wf)
#> [1] 4245
crash_summary = tc_join_stats19(crashes_wf, casualties_wf, vehicles_wf)
#> Joining, by = "accident_index"
nrow(crash_summary)
#> [1] 3449

Created on 2020-07-17 by the reprex package (v0.3.0)

layik commented 4 years ago

This might be the next best ticket for me.

Robinlovelace commented 4 years ago

Fantastic you're up for looking at it. I think it would be great if the function, or family of tc_join*() functions, can output data that is:

A question is whether to make them arguments in one main function or separate functions. From a usability perspective I would err towards separate function for each, e.g.

layik commented 3 years ago

Just trying to understand this better. I see that the point of the work in tc-join as it stands is to generate a df where we can see which vehicles were involved in which crash. I suppose, @Robinlovelace would then want to see UpSet plots for casualty types and ? What I mean is there would be no tc_join_stats19_ac because that is the basis for the other two and indeed, accident_index and year were always our keys. Correct?

Robinlovelace commented 3 years ago

I suppose, @Robinlovelace would then want to see UpSet plots for casualty types and ?

Yes it would be good to see them for number of casualties, number of vehicles and number of crash records, and there could be different combinations (e.g. Y axis being number of casualties and X axis being vehicle type) perhaps. The outputs of tc_join() functions could be useful for a range of different things, not just upset plots. One approach would be to use the dm package but that would introduce more overheads so suggest we don't use it for now, but good to be aware of alternative approaches that could be useful later: https://github.com/krlmlr/dm

there would be no tc_join_stats19_ac because that is the basis for the other two and indeed, accident_index and year were always our keys. Correct?

I think tc_join_stats19_ac() could be useful but may need aggregating functions, e.g. to count the number of cyclists, pedestrians etc in the casualties table who are hurt per crash. @joeytalbot has done that in previous scripts I think, please share a link to code that does that if you get a chance Joey.

Hope that makes sense...

layik commented 3 years ago

@Robinlovelace wont be making a pull out of this yet, like to know what would be at least one useful function from your comment above so I can implement/improve/contribute further. As it stands, not quite able to translate your comment into code.

Robinlovelace commented 3 years ago

No worries, you could take a look at adding some comment to this instead, starting with the building blocks of the ac, ca and ve tables could be a starter for deciding how to best to write code to automate parts of the joining process. Alternatively, it's possible that this is one of those things that is best just describing and not 'over functionalising' as @rogerbeecham was alluding to with respect to the upset plot code.

Here's a section in need of content (will try to make the edit button work now but the source should be easy to find): https://saferactive.github.io/rrsrr/joining-road-crash-tables.html

layik commented 3 years ago

OK, so just doing some work here and @Robinlovelace whilst I was watching has done a good section in the rrsrr on this. Just found out that not all indices in casualties and vehicles are in the accidents table.

Is this something @mem48 is an expert in? Anyone else?

I guess the question I must ask: what do we do with those records in the case of joining them?

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
ac = get_stats19(year = 2019, type = "ac", output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2019.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> Reading in:
#> /tmp/Rtmpzevr4i/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> date and time columns present, creating formatted datetime column
#> 28 rows removed with no coordinates
ca = get_stats19(year = 2019, type = "ca")
#> Files identified: DfTRoadSafety_Casualties_2019.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Casualties_2019/Road Safety Data - Casualties 2019.csv
ve = get_stats19(year = 2019, type = "ve")
#> Files identified: DfTRoadSafety_Vehicles_2019.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Vehicles_2019/Road Safety Data- Vehicles 2019.csv

all(ca$accident_index %in% ac$accident_index)
#> [1] FALSE
all(ve$accident_index %in% ac$accident_index)
#> [1] FALSE

which(!ca$accident_index %in% ac$accident_index)
#>  [1]  32870  35672  37513  37514  42935  43816  44878  49428  49429  49610
#> [11]  49611  49612  49634  49981  50269  50270  50329  50330  50612  50694
#> [21]  50921  50929  50930  51000  51001  51039  51040  51041  51137  53661
#> [31]  53662  60791  76150  76151 117585 126082 139245 140079 143021 143022
#> [41] 143023 145523

which(!ve$accident_index %in% ac$accident_index)
#>  [1]  49394  49395  53112  55581  55582  63126  64441  64442  66036  66037
#> [11]  72305  72522  72523  72544  72545  72996  72997  72998  73396  73397
#> [21]  73485  73486  73876  73877  73989  73990  74272  74283  74284  74378
#> [31]  74379  74433  74434  74560  74561  78032  78033  87939  87940 109329
#> [41] 109330 167863 179919 179920 197938 197939 197940 199070 199071 203025
#> [51] 203026 203027 206315 206316

Created on 2020-10-07 by the reprex package (v0.3.0)

layik commented 3 years ago

It is interesting actually:

nrow(ca) == sum(ac$number_of_casualties) + length(which(!ca$accident_index %in% ac$accident_index))
#> TRUE
Robinlovelace commented 3 years ago

Very interesting @layik. I think it's worth asking the road safety stats team about, suspect it's an error in the data but not sure.

layik commented 3 years ago

Right, so @Robinlovelace is raising this and I am glad he is, because this is not just 2019. Here is what I have found but cannot give you a reprex just yet:

        2015 2016 2017 2018 2019
caNotINac   37   11   25   76   42
veNotINac   53   12   36   96   54
acNotINca    0    0    0    0    0
acNotINve    0    0    0    0    0
layik commented 3 years ago

Reopen if need be