Open Robinlovelace opened 5 years ago
Could also explore this by using other input datasets, e.g.:
2014 data (we may already be using this): http://www.sectra.gob.cl/biblioteca/detalle1.asp?mfn=3253
2012 data (18,000 homes): http://datos.gob.cl/dataset/31616
Combining the two
Other sources (ideas @robsalasco )?
Could also explore this by using other input datasets, e.g.:
- 2014 data (we may already be using this): http://www.sectra.gob.cl/biblioteca/detalle1.asp?mfn=3253
- 2012 data (18,000 homes): http://datos.gob.cl/dataset/31616
- Combining the two
- Other sources (ideas @robsalasco )?
Hi Robin,
The first and the second links are the same data (both from 2012). The second link has the tables in .csv format. I used those ones to join the tables and create the .csv that I sent you by email. So maybe it is possible to import the data using the website and then joining the tables in R instead of using directly the .csv that I create using Excel and SQL.
Thanks @NachoTiznado, makes sense. Looking forward to exploring this from 1pm today.
Could also explore this by using other input datasets, e.g.:
- 2014 data (we may already be using this): http://www.sectra.gob.cl/biblioteca/detalle1.asp?mfn=3253
- 2012 data (18,000 homes): http://datos.gob.cl/dataset/31616
- Combining the two
- Other sources (ideas @robsalasco )?
Recently a paper using cellular data was published (https://royalsocietypublishing.org/doi/full/10.1098/rsos.180749) and you can have a look how the data is organized here https://datadryad.org/resource/doi:10.5061/dryad.9p4r16m
Robin, Can you have a look and tell me if can be suitable for using in pct? I'll ask for permission...
Can do - looking now.
Result: does not that useful. See reprex below. Am I missing something?
# Aim: explore data from cellphone study
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
u = "https://datadryad.org/bitstream/handle/10255/dryad.180116/HW_20days_dataset.csv"
cell = readr::read_csv2(u)
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#> numa_id = col_double(),
#> tower_w = col_character(),
#> tower_h = col_character(),
#> X_w = col_number(),
#> Y_w = col_number(),
#> X_h = col_number(),
#> Y_h = col_number(),
#> Distance = col_number(),
#> angle = col_character()
#> )
summary(cell)
#> numa_id tower_w tower_h
#> Min. : 1 Length:346638 Length:346638
#> 1st Qu.: 86660 Class :character Class :character
#> Median :173320 Mode :character Mode :character
#> Mean :173320
#> 3rd Qu.:259979
#> Max. :346638
#> X_w Y_w X_h
#> Min. :3.455e+07 Min. :6.301e+06 Min. :3.455e+07
#> 1st Qu.:3.433e+09 1st Qu.:6.289e+09 1st Qu.:3.419e+09
#> Median :3.471e+09 Median :6.297e+09 Median :3.467e+09
#> Mean :3.218e+09 Mean :5.749e+09 Mean :3.255e+09
#> 3rd Qu.:3.513e+09 3rd Qu.:6.300e+09 3rd Qu.:3.516e+09
#> Max. :3.603e+09 Max. :6.310e+09 Max. :3.603e+09
#> Y_h Distance angle
#> Min. :6.301e+06 Min. :3.864e+04 Length:346638
#> 1st Qu.:6.286e+09 1st Qu.:1.334e+09 Class :character
#> Median :6.294e+09 Median :2.339e+09 Mode :character
#> Mean :5.725e+09 Mean :3.564e+09
#> 3rd Qu.:6.299e+09 3rd Qu.:5.731e+09
#> Max. :6.310e+09 Max. :1.000e+10
# cellagg = cell %>%
# group_by(X_w, Y_w, X_h, Y_h) %>%
# summarise(n = n())
cellagg = cell %>%
group_by(tower_h, tower_w) %>%
summarise(X_h = mean(X_h), Y_h = mean(Y_h), X_w = mean(X_w), Y_w = mean(Y_w), n = n())
hist(cellagg$n)
summary(cellagg$n)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 1.000 1.000 2.269 2.000 228.000
cellagg_sub = cellagg %>%
filter(n > 50)
celld = stplanr::od_coords2line(odc = cellagg_sub[3:6], crs = 5361)
plot(celld)
mapview::mapview(celld)
Created on 2019-03-20 by the reprex package (v0.2.1)
Result: does not that useful. See reprex below. Am I missing something?
# Aim: explore data from cellphone study library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union u = "https://datadryad.org/bitstream/handle/10255/dryad.180116/HW_20days_dataset.csv" cell = readr::read_csv2(u) #> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control. #> Parsed with column specification: #> cols( #> numa_id = col_double(), #> tower_w = col_character(), #> tower_h = col_character(), #> X_w = col_number(), #> Y_w = col_number(), #> X_h = col_number(), #> Y_h = col_number(), #> Distance = col_number(), #> angle = col_character() #> ) summary(cell) #> numa_id tower_w tower_h #> Min. : 1 Length:346638 Length:346638 #> 1st Qu.: 86660 Class :character Class :character #> Median :173320 Mode :character Mode :character #> Mean :173320 #> 3rd Qu.:259979 #> Max. :346638 #> X_w Y_w X_h #> Min. :3.455e+07 Min. :6.301e+06 Min. :3.455e+07 #> 1st Qu.:3.433e+09 1st Qu.:6.289e+09 1st Qu.:3.419e+09 #> Median :3.471e+09 Median :6.297e+09 Median :3.467e+09 #> Mean :3.218e+09 Mean :5.749e+09 Mean :3.255e+09 #> 3rd Qu.:3.513e+09 3rd Qu.:6.300e+09 3rd Qu.:3.516e+09 #> Max. :3.603e+09 Max. :6.310e+09 Max. :3.603e+09 #> Y_h Distance angle #> Min. :6.301e+06 Min. :3.864e+04 Length:346638 #> 1st Qu.:6.286e+09 1st Qu.:1.334e+09 Class :character #> Median :6.294e+09 Median :2.339e+09 Mode :character #> Mean :5.725e+09 Mean :3.564e+09 #> 3rd Qu.:6.299e+09 3rd Qu.:5.731e+09 #> Max. :6.310e+09 Max. :1.000e+10 # cellagg = cell %>% # group_by(X_w, Y_w, X_h, Y_h) %>% # summarise(n = n()) cellagg = cell %>% group_by(tower_h, tower_w) %>% summarise(X_h = mean(X_h), Y_h = mean(Y_h), X_w = mean(X_w), Y_w = mean(Y_w), n = n()) hist(cellagg$n) summary(cellagg$n) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1.000 1.000 1.000 2.269 2.000 228.000 cellagg_sub = cellagg %>% filter(n > 50) celld = stplanr::od_coords2line(odc = cellagg_sub[3:6], crs = 5361)
plot(celld)
mapview::mapview(celld)
Created on 2019-03-20 by the reprex package (v0.2.1)
The data needs preprocessing because are network events (calls, sms, or data transfers) from/to the cellphone towers (the X and Y columns are towers) so the exact origin location needs to be estimated using the other columns (distance and angle) in the db.
Aha I see, that makes sense. Do you know of a way to get a more accurate estimate of the locations of the origins and destinations?
I already have a look on a paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4970143/
Thinking that aggregating the most common destinations could be a way around this.