Closed Robinlovelace closed 3 years ago
This might be the next best ticket for me.
Fantastic you're up for looking at it. I think it would be great if the function, or family of tc_join*()
functions, can output data that is:
A question is whether to make them arguments in one main function or separate functions. From a usability perspective I would err towards separate function for each, e.g.
tc_join_stats19_ac()
tc_join_stats19_ca()
tc_join_stats19_ve()
Just trying to understand this better. I see that the point of the work in tc-join
as it stands is to generate a df
where we can see which vehicles were involved in which crash. I suppose, @Robinlovelace would then want to see UpSet
plots for casualty types and ? What I mean is there would be no tc_join_stats19_ac
because that is the basis for the other two and indeed, accident_index
and year were always our key
s. Correct?
I suppose, @Robinlovelace would then want to see UpSet plots for casualty types and ?
Yes it would be good to see them for number of casualties, number of vehicles and number of crash records, and there could be different combinations (e.g. Y axis being number of casualties and X axis being vehicle type) perhaps. The outputs of tc_join()
functions could be useful for a range of different things, not just upset plots. One approach would be to use the dm package but that would introduce more overheads so suggest we don't use it for now, but good to be aware of alternative approaches that could be useful later: https://github.com/krlmlr/dm
there would be no tc_join_stats19_ac because that is the basis for the other two and indeed, accident_index and year were always our keys. Correct?
I think tc_join_stats19_ac()
could be useful but may need aggregating functions, e.g. to count the number of cyclists, pedestrians etc in the casualties table who are hurt per crash. @joeytalbot has done that in previous scripts I think, please share a link to code that does that if you get a chance Joey.
Hope that makes sense...
@Robinlovelace wont be making a pull out of this yet, like to know what would be at least one useful function from your comment above so I can implement/improve/contribute further. As it stands, not quite able to translate your comment into code.
No worries, you could take a look at adding some comment to this instead, starting with the building blocks of the ac, ca and ve tables could be a starter for deciding how to best to write code to automate parts of the joining process. Alternatively, it's possible that this is one of those things that is best just describing and not 'over functionalising' as @rogerbeecham was alluding to with respect to the upset plot code.
Here's a section in need of content (will try to make the edit button work now but the source should be easy to find): https://saferactive.github.io/rrsrr/joining-road-crash-tables.html
OK, so just doing some work here and @Robinlovelace whilst I was watching has done a good section in the rrsrr
on this. Just found out that not all indices in casualties and vehicles are in the accidents table.
Is this something @mem48 is an expert in? Anyone else?
I guess the question I must ask: what do we do with those records in the case of joining them?
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
ac = get_stats19(year = 2019, type = "ac", output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2019.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> Reading in:
#> /tmp/Rtmpzevr4i/DfTRoadSafety_Accidents_2019/Road Safety Data - Accidents 2019.csv
#> date and time columns present, creating formatted datetime column
#> 28 rows removed with no coordinates
ca = get_stats19(year = 2019, type = "ca")
#> Files identified: DfTRoadSafety_Casualties_2019.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Casualties_2019/Road Safety Data - Casualties 2019.csv
ve = get_stats19(year = 2019, type = "ve")
#> Files identified: DfTRoadSafety_Vehicles_2019.zip
#> http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2019.zip
#> Attempt downloading from:
#> Data saved at /tmp/Rtmpzevr4i/DfTRoadSafety_Vehicles_2019/Road Safety Data- Vehicles 2019.csv
all(ca$accident_index %in% ac$accident_index)
#> [1] FALSE
all(ve$accident_index %in% ac$accident_index)
#> [1] FALSE
which(!ca$accident_index %in% ac$accident_index)
#> [1] 32870 35672 37513 37514 42935 43816 44878 49428 49429 49610
#> [11] 49611 49612 49634 49981 50269 50270 50329 50330 50612 50694
#> [21] 50921 50929 50930 51000 51001 51039 51040 51041 51137 53661
#> [31] 53662 60791 76150 76151 117585 126082 139245 140079 143021 143022
#> [41] 143023 145523
which(!ve$accident_index %in% ac$accident_index)
#> [1] 49394 49395 53112 55581 55582 63126 64441 64442 66036 66037
#> [11] 72305 72522 72523 72544 72545 72996 72997 72998 73396 73397
#> [21] 73485 73486 73876 73877 73989 73990 74272 74283 74284 74378
#> [31] 74379 74433 74434 74560 74561 78032 78033 87939 87940 109329
#> [41] 109330 167863 179919 179920 197938 197939 197940 199070 199071 203025
#> [51] 203026 203027 206315 206316
Created on 2020-10-07 by the reprex package (v0.3.0)
It is interesting actually:
nrow(ca) == sum(ac$number_of_casualties) + length(which(!ca$accident_index %in% ac$accident_index))
#> TRUE
Very interesting @layik. I think it's worth asking the road safety stats team about, suspect it's an error in the data but not sure.
Right, so @Robinlovelace is raising this and I am glad he is, because this is not just 2019. Here is what I have found but cannot give you a reprex just yet:
2015 2016 2017 2018 2019
caNotINac 37 11 25 76 42
veNotINac 53 12 36 96 54
acNotINca 0 0 0 0 0
acNotINve 0 0 0 0 0
Reopen if need be
It should be able to return data at the casualty level. Example below:
Created on 2020-07-17 by the reprex package (v0.3.0)