rOpenSpain / spanishoddata

Access national high-quality and open-access datasets on movement patterns derived from mobile telephone datasets
https://ropenspain.github.io/spanishoddata/
Other
22 stars 0 forks source link

V1 full support based on district level data only #52

Closed e-kotov closed 2 months ago

e-kotov commented 2 months ago

ok, all done, full support for V1 data and much more cool stuff.

Important: to simplify a few things, the v1 data is now downloaded as csv.gz instead of txt.gz. So before testing, please nuke your cache folder or rename all txt.gz files to csv.gz files. txt.gz files should not get in the way as the functions look specifically for csv.gz files, but if you do not remove them, they will just take up storage space.

Please install with

remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 build_vignettes = TRUE,
 force = TRUE
)

If the above fails for you because of vignettes, do:

Sys.setenv(PKG_BUILD_VIGNETTES=TRUE)
pak::pkg_install(
    "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
    upgrade = TRUE
)

Here's an overview:

flowchart TB
    A["For quick analysis of few dates\nwork with raw CSV.gz data"] -->|"spod_get(
    type = 'origin-destination',
    zones = 'districts',
    dates = c(start = '2020-02-14', end = '2020-03-14') )"
    | F["'tbl' object with 'id' for origins and destinations"]

    C["Analyse longer periods (several months)\nor even the whole dataset over several years"]
    -->|"spod_convert_for_analysis(
type = 'origin-destination',
    zones = 'districts',
    dates = c(start = '2020-02-14', end = '2021-05-09') )"| D["path to converted data"]
    D -->|"spod_connect_to_converted_data()" | F["'tbl' object with 'id' for origins and destinations"]

    F -->|"dplyr functions: select(), filter(), mutate(), group_by(), summarise(), etc..."| G["dplyr::collect()"]
    G --> H["Result: data.frame / tibble"] --> R[spatial data matched by 'id' with aggegated mobility flows]

    X["spatial data with zones"] --> |"spod_get_zones(
    zones = 'districts',
    ver = 1 )"| Y["polygons with zones in sf object\nwith 'id' that match with origins and destinations"] --> R

See the new vignette for testing all the functions.

@eugenividal I am very interested in your opinion on how well the concept of working with this virtual duckdb table that you get from spod_get() and spod_connect_to_converted_data() is explained in the vignette https://github.com/Robinlovelace/spanishoddata/blob/v1-full-support-based-on-district-level-data-only/vignettes/v1-2020-2021-mitma-data-codebook.qmd . Also, suggestions for better function names, specifically the ones that are user-facing functions covered in the vignette.

Robinlovelace commented 2 months ago

Woohoo, thanks @e-kotov! That is an impressive schematic of function and LOT of commits. Promising. I will give this a proper review by Monday evening latest.

Robinlovelace commented 2 months ago

Quickfire comment seeing the workflow error message:

  '::' or ':::' imports not declared from:
    ‘knitr’ ‘od’ ‘odjitter’ ‘osmactive’ ‘remotes’ ‘zonebuilder’
  'library' or 'require' calls not declared from:
    ‘remotes’ ‘tidyverse’ ‘tmap’

If you add those to 'Suggests', GitHub actions should become happier.

e-kotov commented 2 months ago

If you add those to 'Suggests', GitHub actions should become happier.

not really) https://github.com/Robinlovelace/spanishoddata/actions/runs/10516439464/job/29138644071?pr=52

anyway, it passes locally

e-kotov commented 2 months ago

@Robinlovelace are we on for the scheduled Monday meeeting?

Robinlovelace commented 2 months ago

@Robinlovelace are we on for the scheduled Monday meeeting?

Yes I can make that.

eugenividal commented 2 months ago

Hi @e-kotov. Thanks for sending this. I wanted to give you my feedback today, but it will take longer. So, if it is ok, I'll review this, like Robin, at the latest on Monday eve. I'll focus on giving you my opinion on this good-looking virtual table explained in the vignette "Codebook and Cookbook for v1", and thinking of better functions' names.

e-kotov commented 2 months ago

Ok, I made some minor changes and bug fixes. Important note, there are now three new vignettes:

I also simplified the README and froze all code execution, at least for now. Also to save energy of GitHub actions. All plots and tables are currently hardcoded in README. Also hidden advanced details.

Robinlovelace commented 2 months ago

Picking up on this now, great to see tests passing!

Robinlovelace commented 2 months ago

Checked-out with:

gh pr checkout 52
e-kotov commented 2 months ago

@eugenividal Great and thorough review. I will address you comments soon. Quick question, have you actually installed the package from this branch and tried to use it as described?

e-kotov commented 2 months ago

@eugenividal

If you consider it relevant, I can translate these two documents (they are short), so that we can presented them just as our English translations.

On one hand, that would be great, as machine translation is of course not perfect, on the other, it does not seem like a priority to me at the moment. Also, consider that methodology docs for v2 data are MUCH longer, and we would want them to be translated too if you do that for v1. So... Consider this before you commit to translating)

I see a middle ground in translating the most important bits of methodologies and the ones that you think the automatic translation got very wrong and including them into our vignettes. I think the text in the vignettes is more important, because the vignettes will be pagers on the pkgdown website of the package and will be indexed better by search engines than any PDFs we attach.

Robinlovelace commented 2 months ago

@e-kotov there seems to be duplication between these two vignettes:

ls vignettes/*convert*
vignettes/converting-the-data-for-faster-analysis.qmd  vignettes/download-and-convert-all-data.qmd

Also: not a fan of the filenames, how about download-convert or just convert.qmd: conversion implied download, right?

Robinlovelace commented 2 months ago

Suggestion: let's have a single codebook.qmd doc.

eugenividal commented 2 months ago

@eugenividal Great and thorough review. I will address you comments soon. Quick question, have you actually installed the package from this branch and tried to use it as described?

@e-kotov, I am glad you found my comments useful. I just tried installing the package from this branch and using it. I found some issues:

1) I had issues with the instructions you gave in the first comment of this PR. See screenshot below: Screenshot from 2024-08-26 12-31-19

2) I had some issues when running the code in the vignette. See the sequence of commands and messages below: Screenshot from 2024-08-26 12-28-10

I hope that helps.

e-kotov commented 2 months ago

Suggestion: let's have a single codebook.qmd doc.

For v1 and v2 in one file? I would refrain from that. In the end, the two datasets are very different. I would separate them on purpose to make it clear they are NOT compatible (and the official methodology doc for v2 emphasises, that the two datasets should not be compared directly, e.g. one cannot directly compare trip counts from 2020-2021 (v1) and 2022+ (v2). Also, in v2 data there will be many more data sets that are not available in v1.

e-kotov commented 2 months ago

I just tried installing the package from this branch and using it. I found some issues

@eugenividal

Thanks for that. Before installing the package try installing quarto:

install.packages("quarto")

Overall, this is not a big problem, as one can always install without vignettes. When the package is on CRAN, this will not be an issue. This is only relevant for installs from GitHub.

Since you were not able to install the package because of the errors in the first screenshot, you are not using the latest version and therefore the code from the vignettes will not work.

e-kotov commented 2 months ago

@eugenividal you can install without vignettes for now:

remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 force = TRUE
)

Without vignettes the spod_codebook will not work, but other functions should.

eugenividal commented 2 months ago

Thanks for that. Before installing the package try installing quarto:

@e-kotov After installing quarto and the spanishoddata package, I get the same error messages if I run each of the chunks separately. However, I can render v1-2020-2021-mitma-data-codebook.qmd

e-kotov commented 2 months ago

@eugenividal

I can render v1-2020-2021-mitma-data-codebook.qmd

Rendering works because no code is actually evaluated, the option that turns this off is in the header of the qmd file.

It seems like you still don't have the version from this particular branch.

If you run spod_get (yes without the round bracets):

spod_get

You should be getting in console:

function (type = c("od", "origin-destination", "os",
    "overnight_stays", "tpp", "trips_per_person"),
    zones = c("districts", "dist", "distr",
        "distritos", "municipalities", "muni",
        "municip", "municipios"), dates = NULL, data_dir = spod_get_data_dir(),
    quiet = FALSE, max_mem_gb = 3, max_n_cpu = parallelly::availableCores() -
        1, max_download_size_gb = 1, duckdb_target = ":memory:",
    temp_path = spod_get_temp_dir())
{

and the rest of the function code...

If your first lines of the output are different, you are not on the relevant package version.

In that case try:

remove.packages("spanishoddata")
remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 build_vignettes = FALSE,
 force = TRUE
)
eugenividal commented 2 months ago

@e-kotov

On one hand, that would be great, as machine translation is of course not perfect, on the other, it does not seem like a priority to me at the moment. Also, consider that methodology docs for v2 data are MUCH longer, and we would want them to be translated too if you do that for v1. So... Consider this before you commit to translating)

Agree👌 Not a priority, we can consider this later on.

eugenividal commented 2 months ago

@e-kotov

You should be getting in console:

This is what I get if I run spod_get Screenshot from 2024-08-26 16-05-22

In that case try:

Still the same error messages 😔

e-kotov commented 2 months ago

@eugenividal I don't know why, but for some reason you are getting the "old" version of the package from the main branch, instead of getting the one at the v1-full-support-based-on-district-level-data-only branch.

eugenividal commented 2 months ago

@e-kotov

@eugenividal I don't know why, but for some reason you are getting the "old" version of the package from the main branch, instead of getting the one at the v1-full-support-based-on-district-level-data-only branch.

Thanks for spotting the reason, I will try to find a way to fix it.

e-kotov commented 2 months ago

Thanks for spotting the reason, I will try to find a way to fix it.

Ideally,

remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 build_vignettes = FALSE,
 force = TRUE
)

should be installing the v1-full-support-based-on-district-level-data-only branch, hence the @v1-full-support-based-on-district-level-data-only... But why it fails for you, I have no clue :(

Robinlovelace commented 2 months ago

Suggestion: let's have a single codebook.qmd doc.

For v1 and v2 in one file? I would refrain from that. In the end, the two datasets are very different. I would separate them on purpose to make it clear they are NOT compatible (and the official methodology doc for v2 emphasises, that the two datasets should not be compared directly, e.g. one cannot directly compare trip counts from 2020-2021 (v1) and 2022+ (v2). Also, in v2 data there will be many more data sets that are not available in v1.

:+1:

Robinlovelace commented 2 months ago

Thanks for spotting the reason, I will try to find a way to fix it.

Ideally,

remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 build_vignettes = FALSE,
 force = TRUE
)

should be installing the v1-full-support-based-on-district-level-data-only branch, hence the @v1-full-support-based-on-district-level-data-only... But why it fails for you, I have no clue :(

You're missing Quarto, should be:

remotes::install_github(
 "Robinlovelace/spanishoddata@v1-full-support-based-on-district-level-data-only",
 build_vignettes = FALSE,
 force = TRUE,
  depedencies = TRUE
)

That will install the 'deps'.

Robinlovelace commented 2 months ago

@e-kotov good to go now assuming you're OK with #56. Non-controversial, and tested on V2 data for parquet and duckdb as you'll see. If you do want to change the pkg name let's do that as a separate PR, this one is already big!

e-kotov commented 2 months ago

@Robinlovelace reviewing and fixing a few typos as I go. Will be done soon. Regarding name change and separate PR - ok. Still thinking about it.

eugenividal commented 2 months ago

You're missing Quarto, should be:

Working!!!

e-kotov commented 2 months ago

@eugenividal just merged latest function name changes (not spod_convert and spod_connect, even if these are still controversial, at least they are short ;) ) So you may want to re-install the package once again using Robin's code with deps=true.

e-kotov commented 2 months ago

@Robinlovelace Ok, to me this PR seems mostly done. Thanks for improving and combining the vignettes and testing with v2 data. In a few hours I will implement @eugenividal comments on the wording of the codebook and merge this.

Robinlovelace commented 2 months ago

Great, let me know if you want me to postpone putting it out there but sounds like we're on-track for launch.

Robinlovelace commented 2 months ago

All on track to merge this by 15:00 UK time (16:00 in Germany I believe) @e-kotov ?

e-kotov commented 2 months ago

@Robinlovelace yes, cleaned up 99%. Ready. We don't have a logo)

e-kotov commented 2 months ago

Done. Let's see how the pkgdown website renders and looks live and perhaps will make a few tweaks.

e-kotov commented 2 months ago

Closes #3, #27, #33, #34, #35, #57. Also solves:

e-kotov commented 2 months ago

OK, now we are done. We have a working package,

a logo

and an opengraph card for social media

e-kotov commented 2 months ago

@Robinlovelace can you upload this https://github.com/Robinlovelace/spanishoddata/blob/main/man/figures/card.png?raw=true to Social Preview section of https://github.com/Robinlovelace/spanishoddata/settings ? I cannot access settings.

Robinlovelace commented 2 months ago

Sure!

eugenividal commented 2 months ago

@e-kotov The logo is super cool. I loved it! 🤩

Just a quick question: is that plot real? I meant does the arrow width represent the volume of trips? I'd have expected most trips in the triangle Madrid - Barcelona - Valencia

e-kotov commented 2 months ago

@eugenividal it is real. I would have spent the whole day to make it by hand) I will post the code a bit later, I want to find a good place to put it. Probably should not be in the package itself in the R folder, but in some other folder in the repo. These flows are aggregated and therefore what you see could be more of a visual artifact. There are not between cities, but between provinces.

eugenividal commented 2 months ago

@e-kotov thanks for your quick reply. That's interesting!

e-kotov commented 2 months ago

@eugenividal ah, also I did some flows aggregation as there's some bug in flowmapper that I did not have time to deal with

Screenshot 2024-08-28 at 10 22 29

So the lines in the logo actually combine flows for just 10 nodes and some provinces are aggregated into single point... The image above is with 19 nodes (max allowed by the package)

eugenividal commented 2 months ago

@e-kotov There are 17 autonomous communities and two autonomous cities (Ceuta and Melilla) in Spain. Perfect for the max allowed in the package!

Robinlovelace commented 2 months ago

Great set of comments @eugenividal. PR welcome, look forward to seeing @e-kotov's comments, your suggestions seem good to me, thank you!

e-kotov commented 2 months ago

@eugenividal Thanks for a thorough review and feedback. Working on updated version in a new branch here: https://github.com/Robinlovelace/spanishoddata/blob/minor-fixes-ek/vignettes/convert.qmd . Work-in-progress, so no new review necessary. I will do a PR when I'm ready.

eugenividal commented 2 months ago

@e-kotov No problem. Great! I'll be working on other stuff until Tuesday. Yes, please, send a new PR when it is ready and let me know from Tuesday if you would like me to look at any other aspects.