V1 full support - cancelled

e-kotov commented 2 months ago

@Robinlovelace Please don't review it just yet. I'm only doing the PR to see the R CMD check on Ubuntu. I will add more commits to this one before it is truly as the name suggests "V1 full support".

However, here's what's new:

there is now a total download check before any data is downloaded. Default silent download is up to 1 GB. There is also an internal util to query the file sizes. I suppose it is worth updating those cached files before every package release, but things should work fine as I implemented basic file size imputation for new data.
further major simplification of internal SQL handling
removed old spod_get(), replaced with the new one that can fetch and connect as lazy table any data, see docs. It wraps around the functions tailored to individual data sets
complete V1 data was supposed to be supported here (and it kind of is, see spod_duckdb_trips_per_person()), but currently I turned off the trips per person option for spod_get(). We have a problem that I need to look deeper into, but basically there is a mismatch of IDs in the tables and the zones file for municipalities... and it is complicated. I started fixing it in spod_clean_zones_v1() by manually reassigning mismatches, but it is more complicated than that. Will file it in https://github.com/e-kotov/mitma-data-issues a bit later, perhaps we will have to raise that with the data provider. More details to come.
v1 data is now downloaded as csv.gz instead of txt.gz (though the delimiter is '|'...), I think this is better for consistency.

For now, this can be tested if you want, but don't merge and delete just yet.

Robinlovelace commented 2 months ago

All sounds good to me. Maybe delete the man/figures files which conflict with the files on the main branch.

e-kotov commented 2 months ago

maybe delete the man/figures files which conflict with the files on the main branch.

I optimized the pngs to 256 colours to reduce their size, as they were taking up almost 1mb in package, now just 300kb.

e-kotov commented 2 months ago

Due to multiple issues with v1 municipal level data, I will keep this branch as storage for the tons of draft code to introduce hacks to solve these issues. I got tired of introducing these hacks and I probably won't solve all the issues.

Specific issues (a bit lost at some point...):

A better approach is to use the district level data, even though there is more to download, and just to re-aggregate it to MITMA municipalities (defined by the boundaries they supply) using reference tables relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. This way we are actually:

saving space on disk if the user wants both district and municipal level data anyway
adding new information to the municipal level flows (as we are re-aggregating from districts that have more columns with useful categories)
removing clutter from our codebase

Disadvantages:

there is more data to download (about 6 times more), as districts level data has more columns which the user may not need
requiring a bit more compute to be made on user side, as on-the-fly re-aggregation of data from districts to municipalities has to be made, so workflows with municipalities will be slower.

However, the alternative, in my opinion, is not to provide the municipal level data at all and to inform the user to aggregate it themselves using the reference data relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. I would rather do it for the user, but print a warning.

e-kotov commented 2 months ago

I will duplicate this branch to cleanup the code and introduce the alternative solution I mentioned that does not require file-level hacking.

Robinlovelace commented 2 months ago

I optimized the pngs to 256 colours to reduce their size, as they were taking up almost 1mb in package, now just 300kb.

How did you do that? Great tip!

300 kb still sounds like a lot btw.

Robinlovelace commented 2 months ago

A better approach is to use the district level data, even though there is more to download, and just to re-aggregate it to MITMA municipalities

Sounds sensible, I do this kind of thing also with UK data.

Robinlovelace commented 2 months ago

Disadvantages:
* there is more data to download (about 6 times more), as districts level data has more columns which the user may not need

* requiring a bit more compute to be made on user side, as on-the-fly re-aggregation of data from districts to municipalities has to be made, so workflows with municipalities will be slower.
However, the alternative, in my opinion, is not to provide the municipal level data at all and to inform the user to aggregate it themselves using the reference data relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. I would rather do it for the user, but print a warning.

These disadvantages are no big deal. In general my approach would be 'less is more' and "don't try to do too much for the users", reducing the package's functionalities will make it easier to maintain long-term. But no harm in providing extra functionality for municipal-level data. I suspect most users will use 'district' level data, although some of those districts are bigger than the individual ones, right?

e-kotov commented 2 months ago

How did you do that? Great tip!

@Robinlovelace I use this:

pngquant --ext .png --speed 1 --force 256 '$SOURCE_SELECTION_PATHS'

It is best to wrap that into a script that works in a whole folder or on a list of files.

although some of those districts are bigger than the individual ones, right?

All districts are smaller or equal to municipalities. So we are not loosing any information.

Robinlovelace commented 2 months ago

Yes, I mean that some of the 'districts' are aggregations of a smaller official zoning system for districts, or the districts in the package are an official geographic zoning system in Spain?

@eugenividal may also know.

e-kotov commented 2 months ago

@Robinlovelace even districts are also sometimes aggregations of official census districts. There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.

Robinlovelace commented 2 months ago

@Robinlovelace even districts are also sometimes aggregations of official census districts. There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.

Makes sense. Do you know of any uses of the 'distritos' zones in the package outside of the package? I had the sense that it was a custom zoning system created for the OD data use case but not sure where from.

eugenividal commented 2 months ago

I agree. Districts are smaller than municipalities, but they do not exactly coincide with census districts, or at least not all of them. Is it maybe related with the cell towers location? not sure how they get the mobile phone positioning data.

Robinlovelace commented 2 months ago

agree. Districts are smaller than municipalities, but they do not exactly coincide with census districts

Just to be clear, do you agree that the 'distritos' zones are not used anywhere except in this package?

There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.

Thumbs up to that, what do you mean by "add to the clean zones" though?

Great that there's a lookup to smaller zones, that should be mentioned in the docs at least.

eugenividal commented 2 months ago

I am not 100% sure. I can look at what exactly the 'distritos' are tomorrow.

e-kotov commented 2 months ago

Thumbs up to that, what do you mean by "add to the clean zones" though? In the cleaned up zones polygons I will add the ids of the actual census districts and municipalities that they correspond to.

I am not 100% sure. I can look at what exactly the 'distritos' are tomorrow.

They are definitely based on census districts (not tracts), but yes, some of them are larger and consist of several census districts because of data privacy. If smaller units were used, they would not be able to reveal the origin destination data at hourly intervals, because an individual would be identifiable.

Robinlovelace commented 2 months ago

They are definitely based on census districts (not tracts), but yes, some of them are larger and consist of several census districts because of data privacy. If smaller units were used, they would not be able to reveal the origin destination data at hourly intervals, because an individual would be identifiable.

👍 and as I recalled. It's worth stating that, e.g. "There are x,xxx zones in the 'distritos' zoning system provided by the package, compared with x,xxx municipalities in Spain [link] and x,xxx census districts of which the 'distritos' zoning system is composed", worth putting in as an issue? Could be a quick fix if you fancy making more tweaks to the docs @eugenividal as someone with local knowledge.

eugenividal commented 2 months ago

I had a look at the MITMA districts. The estudios basicos zoning is explained here:

“The base zoning of the study is made up of census districts and aggregations of these (in order to ensure compliance with current regulations on data protection) for the case of the territory of Spain and NUTS-3 zones for France and Portugal and 1 zone for the rest of foreign countries. This zoning presents a total of 3,743 zones for the national territory and 117 zones for France and Portugal and 1 zone for abroad that covers the rest of the world.

Based on this base zoning, two others are generated, one at the municipal level (including aggregations of municipalities for data protection) and another at the level of large urban areas (GAUs).”

There is this table that shows the relation between all the MITMA zones (districts, municipalities and GAUS) and the INE (Spanish National Statistics Institute) zones.

eugenividal commented 2 months ago

I will open an issue and will briefly explain this in the zones section of the README

e-kotov commented 2 months ago

@eugenividal I would suggest no to include too many details in the README and create a separate detailed cookbook/codebook page. I already started such page for V1 data (2020-2021) in a new branch, you can see the work in progress here: https://github.com/Robinlovelace/spanishoddata/blob/v1-full-support-based-on-district-level-data-only/vignettes/v1-2020-2021-mitma-data-codebook.qmd

Ideally, I would have such page for v1 and v2, and then only extract the key info to the README.md/qmd

eugenividal commented 2 months ago

Perfect! Thanks @e-kotov. I will include the zoning clarification on this page for the time being.

Robinlovelace commented 2 months ago

👍 to all this, many thanks for stepping-up on this Eugeni!

rOpenSpain / spanishoddata

V1 full support - cancelled #42