ropensci / bikedata

:bike: Extract data from public hire bicycle systems
https://docs.ropensci.org/bikedata
82 stars 17 forks source link

bikedata misses numerous csv files from London #92

Open Robinlovelace opened 5 years ago

Robinlovelace commented 5 years ago

As detected by @rogerbeecham. A 2 line fix I came up with is this:

csvnames2 = str_replace_all(csvnames, "\\.+(?!csv?)|_|\\s", replacement = "")
file.rename(csvnames, csvnames2)
Robinlovelace commented 5 years ago

Should also work with gsub().

Robinlovelace commented 5 years ago

Also, seems 2014 data is duplicated:

image

mpadge commented 5 years ago

Thanks @robinlovelace, I'll check it out

Robinlovelace commented 5 years ago

May be able to submit a patch.

Robinlovelace commented 5 years ago

Updated cleaned data, with missing ids in grey:

image

Robinlovelace commented 5 years ago

Ground truth data from TfL daily counts, cc @rogerbeecham.

library(ggplot2)
    u = "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xls"
    download.file(u, "tfl-daily-cycle-hires.xls")
    daily_hires_schema = readxl::read_excel("tfl-daily-cycle-hires.xls")
    daily_hires = readxl::read_excel("tfl-daily-cycle-hires.xls", sheet = 2)
#> New names:
#> * `Number of Bicycle Hires` -> `Number of Bicycle Hires...2`
#> * `` -> ...3
#> * Month -> Month...4
#> * `Number of Bicycle Hires` -> `Number of Bicycle Hires...5`
#> * `` -> ...6
#> * … and 3 more problems
    range(daily_hires$Day)
#> [1] "2010-07-30 UTC" "2019-07-31 UTC"
    names(daily_hires)[2] = "Number of hires"
    daily_hires$Monthly = RcppRoll::roll_mean(x = daily_hires$`Number of hires`, n = 30, fill = NA)
    daily_hires$Yearly = RcppRoll::roll_mean(x = daily_hires$`Number of hires`, n = 365, fill = NA)
    ggplot(daily_hires, aes(Day, `Number of hires`)) +
        geom_point(alpha = 0.1) +
        geom_line(aes(Day, Monthly), lwd = 1) +
        geom_line(aes(Day, Yearly), colour = "blue", lwd = 1) +
        xlab("Year") +
        ylim(c(0, 50000)) +
        xlim(as.POSIXlt(c("2010-01-01", "2019-10-01")))
#> Warning: Removed 2 rows containing missing values (geom_point).
#> Warning: Removed 29 rows containing missing values (geom_path).
#> Warning: Removed 364 rows containing missing values (geom_path).

Created on 2019-08-29 by the reprex package (v0.3.0)

mpadge commented 5 years ago

Thanks @robinlovelace, I'll clean all these up next week. For London, it's usually just minor formatting and file naming stuff, so shouldn't be too hard

Robinlovelace commented 4 years ago

Another illustration of this - should be able to find duplicate files that go into the DB with this:

image

Robinlovelace commented 4 years ago

The distinct() function in dplyr can find them, frustratingly the trip ids differ:

> nrow(trips_df_duplicated) / nrow(trips_df) # 
[1] 0.824824
> 
> 
> trips_df_duplicated = distinct(trips_df %>% select(start_time, stop_time, start_station_id, end_station_id))
> nrow(trips_df_duplicated) / nrow(trips_df) # 
[1] 0.8273564
mpadge commented 4 years ago

Thnx @Robinlovelace, and sorry this has taken so long to get around to. I'm trying to get the pkg back n CRAN, but having a bit of trouble there. Once it's back up, i'll get straight onto this

Robinlovelace commented 4 years ago

Aha, didn't even realise it was down! Hope this info is of use and not a hassle, will share with you any code we write that may be of use to you - not huge changes, but fiddly ones that TfL's ad-hoc data file naming systems seems to have caused!