Closed abrac closed 3 years ago
Hi @abrac. Thank you for opening this issue. What version of the gtfs2gps package are you using? Also, could you please share your sessionInfo()
in a comment below?
Hi @rafapereirabr! I'm so sorry for the delay! The email notification went to my junk folder.
I am using the latest release, v1.5.0. I installed it using the install.packages("gtfs2gps")
command.
Here is the sessionInfo()
:
Hi again. I tried downgrading to version v1.4.0. You won't believe it, but now the issue is fixed in the sao example, but the issue pops up in the poa example. So, it's the opposite of what happened in my original post. 🤔
Here are the results of the vignette when I run it with gtfs2gps v1.4.0:
However, for my GTFS dataset, I still get NA values for the departure times when I run gtfs2gps()
. So downgrading to v1.4.0 didn't fix my issue.
Just an update: I tried a few more things. I tried downgrading to version 1.3.2. That didn't work. I then saw @pedro-andrade-inpe's latest merge-request: #208. I compiled it from source and tried using that version, but it also didn't work. I am not sure what is the problem, but perhaps it might be a problem with my data...? I will try figure it out, but I'm not sure where to start. Just to clarify my problem: No matter which version of gtfs2gps I use, the departure_time
and cum_time
columns are NA for all rows. Although I didn't mention it before, the cum_time
column is the one that I need the most.
@abrac My last merge was mostly related to an issue opened by CRAN, but I also fixed a small issue related to departure_time
.
After updating the CRAN version this issue will be my priority.
In my experience several gtfs2gps::gtfs2gps
output’s we get NA
cum_time
s because the
departure_time
is not available for some route segments (valid stop_times
).
This means that we can’t estimate the difference in time (consequently average speed)
between valid stop_times
.
One of the ways I use to get around these problems of cum_time
is to replace
invalid speeds for valid ones
In this example, I replace the problematic speeds with the mean value of the valid ones.
library(data.table)
library(magrittr)
spo_gps <- gtfs2gps::read_gtfs(system.file("extdata/saopaulo.zip", package = "gtfs2gps")) %>%
gtfs2gps::filter_by_shape_id(c("52421", "52857")) %>%
gtfs2gps::filter_single_trip() %>%
gtfs2gps::gtfs2gps() %>%
base::suppressMessages()
head(spo_gps, 2)
#> id shape_id trip_id trip_number route_type shape_pt_lon shape_pt_lat
#> 1: 1 52421 121G-10-0 1 3 -46.56379 -23.52212
#> 2: 2 52421 121G-10-0 1 3 -46.56353 -23.52198
#> departure_time stop_id stop_sequence dist cumdist cumtime
#> 1: <NA> <NA> NA 0.00000 [m] 0.00000 [m] NA [s]
#> 2: <NA> <NA> NA 30.46157 [m] 30.46157 [m] NA [s]
#> speed
#> 1: NA [km/h]
#> 2: NA [km/h]
replace speeds with problems by ‘NA’
spo_gps[, speed := as.numeric(speed)]
spo_gps[speed == "Inf" | is.na(speed) | is.nan(speed), speed := NA]
spo_gps[speed > 80 | speed < 2, speed := NA] # too slow or too fast
fill ‘NA’ speed values by mean speed
spo_gps[is.na(speed), speed := mean(spo_gps$speed, na.rm = TRUE), by = .(shape_id)]
spo_gps[, speed := units::set_units(speed, "km/h")]
travelled time
spo_gps[, time := (dist / speed)]
definning a new cumtime
spo_gps[, cumtime_new := cumsum(time), by = .(shape_id, trip_id, trip_number)]
spo_gps[, cumtime_new := units::set_units(cumtime_new, "s")]
head(spo_gps)
#> id shape_id trip_id trip_number route_type shape_pt_lon shape_pt_lat
#> 1: 1 52421 121G-10-0 1 3 -46.56379 -23.52212
#> 2: 2 52421 121G-10-0 1 3 -46.56353 -23.52198
#> 3: 3 52421 121G-10-0 1 3 -46.56327 -23.52184
#> 4: 4 52421 121G-10-0 1 3 -46.56302 -23.52171
#> 5: 5 52421 121G-10-0 1 3 -46.56276 -23.52157
#> 6: 6 52421 121G-10-0 1 3 -46.56249 -23.52143
#> departure_time stop_id stop_sequence dist cumdist cumtime
#> 1: <NA> <NA> NA 0.00000 [m] 0.00000 [m] NA [s]
#> 2: <NA> <NA> NA 30.46157 [m] 30.46157 [m] NA [s]
#> 3: <NA> 910000819 1 30.46157 [m] 60.92315 [m] NA [s]
#> 4: <NA> <NA> NA 30.46157 [m] 91.38472 [m] NA [s]
#> 5: <NA> <NA> NA 30.46157 [m] 121.84630 [m] NA [s]
#> 6: <NA> 910000820 2 31.45530 [m] 153.30159 [m] NA [s]
#> speed time cumtime_new
#> 1: 15.510057 [km/h] 0.000000000 [h] 0.000000 [s]
#> 2: 15.510057 [km/h] 0.001963989 [h] 7.070359 [s]
#> 3: 3.695138 [km/h] 0.008243691 [h] 36.747646 [s]
#> 4: 3.695138 [km/h] 0.008243691 [h] 66.424934 [s]
#> 5: 3.695138 [km/h] 0.008243691 [h] 96.102222 [s]
#> 6: 6.630198 [km/h] 0.004744247 [h] 113.181511 [s]
with some more work you can reestimate the departure_time
values
based on the new cum_time
s.
I hope this helps a little bit.
Created on 2021-09-03 by the reprex package (v2.0.0)
Thank you everyone for your help!
@Joaobazzo: Your code snippets helped a ton! I wanted to do something like that, but wasn't sure how, since I am not familiar with R. I have used your code on my dataset. 👍🏽
Thanks for sharing the solution @Joaobazzo . Given @abrac 's question and the fact that GTFS feeds often have data quality problems, I'm wondering wether we could include a parameter to the gtfs2gps()
function to address this issue.
This could be, for example, a speed_correction
parameter where the options would be:
none
: the function makes no correction to the data. In this case, the output might show NA
results for columns departure_time
, cumtime
and speed
if the original gtfs
input data has quality problems in the departure_time
column.mean
: if the original gtfs
input data has quality problems in the departure_time
column that compromise calculating a vehicle speed between two pairs of consecutive stops for a given trip_id
, then the function imputes to that segment the average speed observed for that trip_id
. value
: a fixed speed value numeric
that will be imputed to all segments with problematic speeds for all trip_id
.It would also be great if we could allow the user to set a minimum and maximum speeds that would trigger those corrections, but we need to think a simple way to expose these parameters to users.
From a code development point of view, implimenting this would be relatively simple. We can creat a new support fuction that wraps us @Joaobazzo ' code. We would only need to apply this support function to the output
object inside the gtfs2gps()
function, right before returning the output.
Hi all. I just wanted to ask a follow up question. @Joaobazzo previously mentioned:
with some more work you can reestimate the departure_time values based on the new cum_times.
I wanted to ask: will the correction of departure_time
values also be done through the changes proposed by @rafapereirabr?
Hi @abrac. Yes, the function will update the departure_time
column accordingly.
Hi @abrac, we have added to the package a new function adjust_speed()
to address this issue. Once you have the GPS-like output of the gtfs2gps()
function, you can use adjust_speed()
to 'fix' those problematic trips. Please have a look at the new function and let us know what you think.
Wow, I see a lot of commits were made over the last few days! Thanks so much everyone! I will try it out today and let you know 😊.
I'm so sorry for the delay! I've tested it with two datasets that I am currently working with, and it worked perfectly! 🙌͏🏽
I will continue using this new version of gtfs2gps, as I process the remaining 6 or 7 datasets that I am planning to work with. Will let you know if I come across any problems.
I realized that the two datasets which I tested with worked fine in the previous version of gtfs2gps too. So in other words, they did not have the quality issues which caused me to open this thread.
So, now I've tested with the original dataset I was using when I opened this thread. Although the adjust_speed
function is re-calculating the cumtime
and speed
columns correctly, it is not updating the departure_time
column.
I'm leaving this issue closed, as the departure_time is not too important for me anymore. However, just thought I would report the issue just in case.
Hi there, I tried using the
gtfs2gps()
function on my dataset, but I got NA values for all the departure times. When I tried running the vignette, I get the same issue with the sao dataset, but not with the poa dataset. Here is the output of the vignette when I run it on my machine: