pedalea / pctSantiago

Place to test a the Propensity to Cycle Tool methods in Santiago
1 stars 0 forks source link

OD data seems messy #1

Open Robinlovelace opened 5 years ago

Robinlovelace commented 5 years ago

Thinking that aggregating the most common destinations could be a way around this.

Robinlovelace commented 5 years ago

Could also explore this by using other input datasets, e.g.:

NachoTiznado commented 5 years ago

Could also explore this by using other input datasets, e.g.:

Hi Robin,

The first and the second links are the same data (both from 2012). The second link has the tables in .csv format. I used those ones to join the tables and create the .csv that I sent you by email. So maybe it is possible to import the data using the website and then joining the tables in R instead of using directly the .csv that I create using Excel and SQL.

Robinlovelace commented 5 years ago

Thanks @NachoTiznado, makes sense. Looking forward to exploring this from 1pm today.

robsalasco commented 5 years ago

Could also explore this by using other input datasets, e.g.:

Recently a paper using cellular data was published (https://royalsocietypublishing.org/doi/full/10.1098/rsos.180749) and you can have a look how the data is organized here https://datadryad.org/resource/doi:10.5061/dryad.9p4r16m

Robin, Can you have a look and tell me if can be suitable for using in pct? I'll ask for permission...

Robinlovelace commented 5 years ago

Can do - looking now.

Robinlovelace commented 5 years ago

Result: does not that useful. See reprex below. Am I missing something?

# Aim: explore data from cellphone study
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

u = "https://datadryad.org/bitstream/handle/10255/dryad.180116/HW_20days_dataset.csv"
cell = readr::read_csv2(u)
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   numa_id = col_double(),
#>   tower_w = col_character(),
#>   tower_h = col_character(),
#>   X_w = col_number(),
#>   Y_w = col_number(),
#>   X_h = col_number(),
#>   Y_h = col_number(),
#>   Distance = col_number(),
#>   angle = col_character()
#> )
summary(cell)
#>     numa_id         tower_w            tower_h         
#>  Min.   :     1   Length:346638      Length:346638     
#>  1st Qu.: 86660   Class :character   Class :character  
#>  Median :173320   Mode  :character   Mode  :character  
#>  Mean   :173320                                        
#>  3rd Qu.:259979                                        
#>  Max.   :346638                                        
#>       X_w                 Y_w                 X_h           
#>  Min.   :3.455e+07   Min.   :6.301e+06   Min.   :3.455e+07  
#>  1st Qu.:3.433e+09   1st Qu.:6.289e+09   1st Qu.:3.419e+09  
#>  Median :3.471e+09   Median :6.297e+09   Median :3.467e+09  
#>  Mean   :3.218e+09   Mean   :5.749e+09   Mean   :3.255e+09  
#>  3rd Qu.:3.513e+09   3rd Qu.:6.300e+09   3rd Qu.:3.516e+09  
#>  Max.   :3.603e+09   Max.   :6.310e+09   Max.   :3.603e+09  
#>       Y_h               Distance            angle          
#>  Min.   :6.301e+06   Min.   :3.864e+04   Length:346638     
#>  1st Qu.:6.286e+09   1st Qu.:1.334e+09   Class :character  
#>  Median :6.294e+09   Median :2.339e+09   Mode  :character  
#>  Mean   :5.725e+09   Mean   :3.564e+09                     
#>  3rd Qu.:6.299e+09   3rd Qu.:5.731e+09                     
#>  Max.   :6.310e+09   Max.   :1.000e+10

# cellagg = cell %>% 
#   group_by(X_w, Y_w, X_h, Y_h) %>% 
#   summarise(n = n())
cellagg = cell %>% 
  group_by(tower_h, tower_w) %>% 
  summarise(X_h = mean(X_h), Y_h = mean(Y_h), X_w = mean(X_w), Y_w = mean(Y_w), n = n())
hist(cellagg$n)
summary(cellagg$n)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.000   1.000   2.269   2.000 228.000
cellagg_sub = cellagg %>% 
  filter(n > 50)
celld = stplanr::od_coords2line(odc = cellagg_sub[3:6], crs = 5361)

plot(celld)

mapview::mapview(celld)

Created on 2019-03-20 by the reprex package (v0.2.1)

NachoTiznado commented 5 years ago

test

plot(1:9)

Created on 2019-03-20 by the reprex package (v0.2.1)

robsalasco commented 5 years ago

Result: does not that useful. See reprex below. Am I missing something?

# Aim: explore data from cellphone study
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

u = "https://datadryad.org/bitstream/handle/10255/dryad.180116/HW_20days_dataset.csv"
cell = readr::read_csv2(u)
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   numa_id = col_double(),
#>   tower_w = col_character(),
#>   tower_h = col_character(),
#>   X_w = col_number(),
#>   Y_w = col_number(),
#>   X_h = col_number(),
#>   Y_h = col_number(),
#>   Distance = col_number(),
#>   angle = col_character()
#> )
summary(cell)
#>     numa_id         tower_w            tower_h         
#>  Min.   :     1   Length:346638      Length:346638     
#>  1st Qu.: 86660   Class :character   Class :character  
#>  Median :173320   Mode  :character   Mode  :character  
#>  Mean   :173320                                        
#>  3rd Qu.:259979                                        
#>  Max.   :346638                                        
#>       X_w                 Y_w                 X_h           
#>  Min.   :3.455e+07   Min.   :6.301e+06   Min.   :3.455e+07  
#>  1st Qu.:3.433e+09   1st Qu.:6.289e+09   1st Qu.:3.419e+09  
#>  Median :3.471e+09   Median :6.297e+09   Median :3.467e+09  
#>  Mean   :3.218e+09   Mean   :5.749e+09   Mean   :3.255e+09  
#>  3rd Qu.:3.513e+09   3rd Qu.:6.300e+09   3rd Qu.:3.516e+09  
#>  Max.   :3.603e+09   Max.   :6.310e+09   Max.   :3.603e+09  
#>       Y_h               Distance            angle          
#>  Min.   :6.301e+06   Min.   :3.864e+04   Length:346638     
#>  1st Qu.:6.286e+09   1st Qu.:1.334e+09   Class :character  
#>  Median :6.294e+09   Median :2.339e+09   Mode  :character  
#>  Mean   :5.725e+09   Mean   :3.564e+09                     
#>  3rd Qu.:6.299e+09   3rd Qu.:5.731e+09                     
#>  Max.   :6.310e+09   Max.   :1.000e+10

# cellagg = cell %>% 
#   group_by(X_w, Y_w, X_h, Y_h) %>% 
#   summarise(n = n())
cellagg = cell %>% 
  group_by(tower_h, tower_w) %>% 
  summarise(X_h = mean(X_h), Y_h = mean(Y_h), X_w = mean(X_w), Y_w = mean(Y_w), n = n())
hist(cellagg$n)
summary(cellagg$n)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.000   1.000   2.269   2.000 228.000
cellagg_sub = cellagg %>% 
  filter(n > 50)
celld = stplanr::od_coords2line(odc = cellagg_sub[3:6], crs = 5361)

plot(celld)

mapview::mapview(celld)

Created on 2019-03-20 by the reprex package (v0.2.1)

The data needs preprocessing because are network events (calls, sms, or data transfers) from/to the cellphone towers (the X and Y columns are towers) so the exact origin location needs to be estimated using the other columns (distance and angle) in the db.

Robinlovelace commented 5 years ago

Aha I see, that makes sense. Do you know of a way to get a more accurate estimate of the locations of the origins and destinations?

robsalasco commented 5 years ago

I already have a look on a paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4970143/