Closed e-kotov closed 2 months ago
All sounds good to me. Maybe delete the man/figures
files which conflict with the files on the main branch.
maybe delete the
man/figures
files which conflict with the files on the main branch.
I optimized the pngs to 256 colours to reduce their size, as they were taking up almost 1mb in package, now just 300kb.
Due to multiple issues with v1 municipal level data, I will keep this branch as storage for the tons of draft code to introduce hacks to solve these issues. I got tired of introducing these hacks and I probably won't solve all the issues.
Specific issues (a bit lost at some point...):
A better approach is to use the district level data, even though there is more to download, and just to re-aggregate it to MITMA municipalities (defined by the boundaries they supply) using reference tables relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. This way we are actually:
Disadvantages:
However, the alternative, in my opinion, is not to provide the municipal level data at all and to inform the user to aggregate it themselves using the reference data relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. I would rather do it for the user, but print a warning.
I will duplicate this branch to cleanup the code and introduce the alternative solution I mentioned that does not require file-level hacking.
I optimized the pngs to 256 colours to reduce their size, as they were taking up almost 1mb in package, now just 300kb.
How did you do that? Great tip!
300 kb still sounds like a lot btw.
A better approach is to use the district level data, even though there is more to download, and just to re-aggregate it to MITMA municipalities
Sounds sensible, I do this kind of thing also with UK data.
Disadvantages:
* there is more data to download (about 6 times more), as districts level data has more columns which the user may not need * requiring a bit more compute to be made on user side, as on-the-fly re-aggregation of data from districts to municipalities has to be made, so workflows with municipalities will be slower.
However, the alternative, in my opinion, is not to provide the municipal level data at all and to inform the user to aggregate it themselves using the reference data relaciones_distrito_mitma.csv and relaciones_municipio_mitma.csv. I would rather do it for the user, but print a warning.
These disadvantages are no big deal. In general my approach would be 'less is more' and "don't try to do too much for the users", reducing the package's functionalities will make it easier to maintain long-term. But no harm in providing extra functionality for municipal-level data. I suspect most users will use 'district' level data, although some of those districts are bigger than the individual ones, right?
How did you do that? Great tip!
@Robinlovelace I use this:
pngquant --ext .png --speed 1 --force 256 '$SOURCE_SELECTION_PATHS'
It is best to wrap that into a script that works in a whole folder or on a list of files.
although some of those districts are bigger than the individual ones, right?
All districts are smaller or equal to municipalities. So we are not loosing any information.
Yes, I mean that some of the 'districts' are aggregations of a smaller official zoning system for districts, or the districts in the package are an official geographic zoning system in Spain?
@eugenividal may also know.
@Robinlovelace even districts are also sometimes aggregations of official census districts. There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.
@Robinlovelace even districts are also sometimes aggregations of official census districts. There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.
Makes sense. Do you know of any uses of the 'distritos' zones in the package outside of the package? I had the sense that it was a custom zoning system created for the OD data use case but not sure where from.
I agree. Districts are smaller than municipalities, but they do not exactly coincide with census districts, or at least not all of them. Is it maybe related with the cell towers location? not sure how they get the mobile phone positioning data.
agree. Districts are smaller than municipalities, but they do not exactly coincide with census districts
Just to be clear, do you agree that the 'distritos' zones are not used anywhere except in this package?
There is a relations table here: https://opendata-movilidad.mitma.es/relaciones_distrito_mitma.csv I will use it to add to the clean zones.
Thumbs up to that, what do you mean by "add to the clean zones" though?
Great that there's a lookup to smaller zones, that should be mentioned in the docs at least.
I am not 100% sure. I can look at what exactly the 'distritos' are tomorrow.
Thumbs up to that, what do you mean by "add to the clean zones" though? In the cleaned up zones polygons I will add the ids of the actual census districts and municipalities that they correspond to.
I am not 100% sure. I can look at what exactly the 'distritos' are tomorrow.
They are definitely based on census districts (not tracts), but yes, some of them are larger and consist of several census districts because of data privacy. If smaller units were used, they would not be able to reveal the origin destination data at hourly intervals, because an individual would be identifiable.
They are definitely based on census districts (not tracts), but yes, some of them are larger and consist of several census districts because of data privacy. If smaller units were used, they would not be able to reveal the origin destination data at hourly intervals, because an individual would be identifiable.
👍 and as I recalled. It's worth stating that, e.g. "There are x,xxx zones in the 'distritos' zoning system provided by the package, compared with x,xxx municipalities in Spain [link] and x,xxx census districts of which the 'distritos' zoning system is composed", worth putting in as an issue? Could be a quick fix if you fancy making more tweaks to the docs @eugenividal as someone with local knowledge.
I had a look at the MITMA districts. The estudios basicos zoning is explained here:
“The base zoning of the study is made up of census districts and aggregations of these (in order to ensure compliance with current regulations on data protection) for the case of the territory of Spain and NUTS-3 zones for France and Portugal and 1 zone for the rest of foreign countries. This zoning presents a total of 3,743 zones for the national territory and 117 zones for France and Portugal and 1 zone for abroad that covers the rest of the world.
Based on this base zoning, two others are generated, one at the municipal level (including aggregations of municipalities for data protection) and another at the level of large urban areas (GAUs).”
There is this table that shows the relation between all the MITMA zones (districts, municipalities and GAUS) and the INE (Spanish National Statistics Institute) zones.
I will open an issue and will briefly explain this in the zones section of the README
@eugenividal I would suggest no to include too many details in the README and create a separate detailed cookbook/codebook page. I already started such page for V1 data (2020-2021) in a new branch, you can see the work in progress here: https://github.com/Robinlovelace/spanishoddata/blob/v1-full-support-based-on-district-level-data-only/vignettes/v1-2020-2021-mitma-data-codebook.qmd
Ideally, I would have such page for v1 and v2, and then only extract the key info to the README.md/qmd
Perfect! Thanks @e-kotov. I will include the zoning clarification on this page for the time being.
👍 to all this, many thanks for stepping-up on this Eugeni!
@Robinlovelace Please don't review it just yet. I'm only doing the PR to see the R CMD check on Ubuntu. I will add more commits to this one before it is truly as the name suggests "V1 full support".
However, here's what's new:
spod_get()
, replaced with the new one that can fetch and connect as lazy table any data, see docs. It wraps around the functions tailored to individual data setsspod_duckdb_trips_per_person()
), but currently I turned off the trips per person option forspod_get()
. We have a problem that I need to look deeper into, but basically there is a mismatch of IDs in the tables and the zones file for municipalities... and it is complicated. I started fixing it inspod_clean_zones_v1()
by manually reassigning mismatches, but it is more complicated than that. Will file it in https://github.com/e-kotov/mitma-data-issues a bit later, perhaps we will have to raise that with the data provider. More details to come.For now, this can be tested if you want, but don't merge and delete just yet.