Open Robinlovelace opened 5 years ago
Should also work with gsub()
.
Also, seems 2014 data is duplicated:
Thanks @robinlovelace, I'll check it out
May be able to submit a patch.
Updated cleaned data, with missing ids in grey:
Ground truth data from TfL daily counts, cc @rogerbeecham.
library(ggplot2)
u = "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xls"
download.file(u, "tfl-daily-cycle-hires.xls")
daily_hires_schema = readxl::read_excel("tfl-daily-cycle-hires.xls")
daily_hires = readxl::read_excel("tfl-daily-cycle-hires.xls", sheet = 2)
#> New names:
#> * `Number of Bicycle Hires` -> `Number of Bicycle Hires...2`
#> * `` -> ...3
#> * Month -> Month...4
#> * `Number of Bicycle Hires` -> `Number of Bicycle Hires...5`
#> * `` -> ...6
#> * … and 3 more problems
range(daily_hires$Day)
#> [1] "2010-07-30 UTC" "2019-07-31 UTC"
names(daily_hires)[2] = "Number of hires"
daily_hires$Monthly = RcppRoll::roll_mean(x = daily_hires$`Number of hires`, n = 30, fill = NA)
daily_hires$Yearly = RcppRoll::roll_mean(x = daily_hires$`Number of hires`, n = 365, fill = NA)
ggplot(daily_hires, aes(Day, `Number of hires`)) +
geom_point(alpha = 0.1) +
geom_line(aes(Day, Monthly), lwd = 1) +
geom_line(aes(Day, Yearly), colour = "blue", lwd = 1) +
xlab("Year") +
ylim(c(0, 50000)) +
xlim(as.POSIXlt(c("2010-01-01", "2019-10-01")))
#> Warning: Removed 2 rows containing missing values (geom_point).
#> Warning: Removed 29 rows containing missing values (geom_path).
#> Warning: Removed 364 rows containing missing values (geom_path).
Created on 2019-08-29 by the reprex package (v0.3.0)
Thanks @robinlovelace, I'll clean all these up next week. For London, it's usually just minor formatting and file naming stuff, so shouldn't be too hard
Another illustration of this - should be able to find duplicate files that go into the DB with this:
The distinct()
function in dplyr can find them, frustratingly the trip ids differ:
> nrow(trips_df_duplicated) / nrow(trips_df) #
[1] 0.824824
>
>
> trips_df_duplicated = distinct(trips_df %>% select(start_time, stop_time, start_station_id, end_station_id))
> nrow(trips_df_duplicated) / nrow(trips_df) #
[1] 0.8273564
Thnx @Robinlovelace, and sorry this has taken so long to get around to. I'm trying to get the pkg back n CRAN, but having a bit of trouble there. Once it's back up, i'll get straight onto this
Aha, didn't even realise it was down! Hope this info is of use and not a hassle, will share with you any code we write that may be of use to you - not huge changes, but fiddly ones that TfL's ad-hoc data file naming systems seems to have caused!
As detected by @rogerbeecham. A 2 line fix I came up with is this: