rOpenSpain / spanishoddata

Access national high-quality and open-access datasets on movement patterns derived from mobile telephone datasets
https://ropenspain.github.io/spanishoddata/
Other
23 stars 0 forks source link

Open data Spain: rOpenSpain #14

Closed llrs closed 2 months ago

llrs commented 3 months ago

Nice package!

Maybe it would be interesting for this package to use other tools to analyze Spanish open data or plot it. There is a github organization collecting this: @ropenspain, mapSpain package might be useful. There are other packages that might complement yours.

If you want you can reallocate the repository there (you'll maintain full powers).

Update since moving it

Robinlovelace commented 3 months ago

That sounds really good @llrs. I think @e-kotov mentioned rOpenSpain and the possibility of moving this repo there, when ready. Would that be possible/wise?

llrs commented 3 months ago

We are just some volunteers writing programs to analyze open data of Spain. There is no review and we only ask to use the pkgdown template of the organization and fill a one line summary for the webpage. The process is simple and very informal: Whenever you are ready let us know (ping me for example) and we will include you to the organization so you can transfer the repo. Then we assign back the permissions to the original authors.

I think it is nice to have some exposure in a cohort. We use r-universe to provide a repo with all the binaries, but you can (and are encouraged to) publish it on CRAN. In terms of reach/impact, I redirected some data journalist to that website to use the packages to plot or analyze other data available.

But that depends on your plans for the package. We have a slack, which is mostly silent except to greet new packages/people and when someone wants to ask for guidance or coordinate between people. In case of doubt, I can send an invitation for the slack and we can discuss there any doubt you might have.

Robinlovelace commented 3 months ago

Many thanks for the info @llrs we'll give it some thought. rOpenSci another good options, we can cross-reference at the very least. More when the package is more evolved..

Robinlovelace commented 2 months ago

Heads-up @llrs the package is now at a decent level or readiness, any comments from you/colleagues welcome. I'm still not sure where best to 'put' the package, lots of options...

llrs commented 2 months ago

I've shared the package in our slack to get some feedback. These days many people might be on holiday and there might not be feedback until later. It looks great, I might use it to check some cycling paths near Barcelona.

e-kotov commented 2 months ago

@llrs ok, actually, now is the time to test, as we have just rolled out the biggest update with breaking changes. All the details are on the package website in articles: https://robinlovelace.github.io/spanishoddata/articles/v1-2020-2021-mitma-data-codebook.html and https://robinlovelace.github.io/spanishoddata/articles/convert.html with more advanced use with other packages here https://robinlovelace.github.io/spanishoddata/articles/uses.html

llrs commented 2 months ago

Thanks for the notification @e-kotov. It is great to see this data gets into usage (although I profoundly dislike the data origin and method used to obtain it).

I have checked the first article and some functions and I have some comments:

I see spod_get in type argument there is a bold message at the end of the documentation. I think it needs a link to the article. On spot_get on zones argument I see you can use "large_urban_areas" (or "lau" ..., shouldn't that be "lua"?

spod_get_data_dir() doesn't use tools::R_user_dir. This folder is more persistent and in line with R/CRAN usage (but using tempdir() is also good but it will only persist until the user restarts the computer).

I see that in max_n_cpu you use parallely, nice! I initially misread the documentation and was about to recommend changing the default as per these examples. Perhaps it could be clarified

There is a typo right below Spatial data with zoning boundaries: " at two geographic levels: Distrtics and Municipalities" Distrtics should be Districts.

I think it would be helpful if the articles include also how to arribe to something like Figure 2 or link to how to do that in the use case article.

On a more general note, you might be interested in duckplyr.

e-kotov commented 2 months ago

@llrs thanks so much for taking such a close look! This is very much appretiated!

I profoundly dislike the data origin and method used to obtain it

I see zero problems with that...

I see spod_get in type argument there is a bold message at the end of the documentation. I think it needs a link to the article.

Good catch, thanks! There are a few gaps in the docs, as we've been through super rapid development cycle throughout the week. Will fix this soon.

UPDATE: fixed in the latest commit in main.

On spot_get on zones argument I see you can use "large_urban_areas" (or "lau" ..., shouldn't that be "lua"?

You are right! Will fix.

UPDATE: fixed in the latest commit in main.

spod_get_data_dir() doesn't use tools::R_user_dir

Are you talking about default path? By default, CRAN seems to prohibit writing anywhere but into temp (though I may be wrong about that by default. Or are you suggesting we download the files into some folder defined by tools::R_user_dir instead of setting the directory with env var or package option? If so, I would be very hesitant to write so much data into user_dir by default. I strongly think that with this size of required download, the user should be very explicit where they want to dump the data.

I see that in max_n_cpu you use parallely, nice! I initially misread the documentation and was about to recommend changing the default as per these examples. Perhaps it could be clarified

I used parallelly::availableCores() because from my experience it is more accurate. E.g. when you are working on a HPC cluster node, parallel::detectCores() detect all physical cores of the node, while parallelly::availableCores() correctly detects only the cores allocated to the job.

I think it would be helpful if the articles include also how to arribe to something like Figure 2 or link to how to do that in the use case article.

Eventually it will. The package we used for vis is already in the review by CRAN. The Use cases article is reproducible, even though one does need a lot of extra packages some of which are still in dev versions.

On a more general note, you might be interested in duckplyr.

Have you used it? I've read @josiahparry's post about it a few month ago. To quote: "The R package duckplyr is a drop in replacement for dplyr. duckplyr operates only on data.frame objects and, as of today, only works with in memory data. This means it is limited to the size of your machine’s RAM."

So as far as I got it, it makes no sense to use it on DuckDB that I am creating, because duckplyr is kind of like data.table, as it creates a DuckDB in-memory, but we don't really need that, as we are already creating a DuckDB ourselves. So one does not really seem to need duckplyr if they have converted the data to DuckDB... I may be wrong. Let me know if I'm misinterpreting and misunderstanding duckplyr.

llrs commented 2 months ago

I see zero problems with that...

I don't want to make a big deal of this, but the data was obtained without the consent of the people using a state of alarm (estado de alarma) that was later ruled as unconstitutional. I work in a place were user consent is very important, specially for identifiable data even if it is released in a form that doesn't allow identification of people.

I see spod_get in type argument there is a bold message at the end of the documentation. I think it needs a link to the article.

Good catch, thanks! There are a few gaps in the docs, as we've been through super rapid development cycle throughout the week. Will fix this soon.

UPDATE: fixed in the latest commit in main.

On spot_get on zones argument I see you can use "large_urban_areas" (or "lau" ..., shouldn't that be "lua"?

You are right! Will fix.

UPDATE: fixed in the latest commit in main.

Great to see this fixed.

spod_get_data_dir() doesn't use tools::R_user_dir

Are you talking about default path? By default, CRAN seems to prohibit writing anywhere but into temp (though I may be wrong about that by default. Or are you suggesting we download the files into some folder defined by tools::R_user_dir instead of setting the directory with env var or package option? If so, I would be very hesitant to write so much data into user_dir by default. I strongly think that with this size of required download, the user should be very explicit where they want to dump the data.

See CRAN's policy:

For R version 4.0 or later (hence a version dependency is required or only conditional use is possible), packages may store user-specific data, configuration and cache files in their respective user directories obtained from tools::R_user_dir(), provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material).

I agree that user should decide where to store the data, but using the temp folder isn't a great solution in my opinion. If one starts working on something then restarts the computer (mandated restart due to windows update) and goes back they would need to download again the data...

I see that in max_n_cpu you use parallely, nice! I initially misread the documentation and was about to recommend changing the default as per these examples. Perhaps it could be clarified

I used parallelly::availableCores() because from my experience it is more accurate. E.g. when you are working on a HPC cluster node, parallel::detectCores() detect all physical cores of the node, while parallelly::availableCores() correctly detects only the cores allocated to the job.

I wasn't complaining on your choice. I was complimenting you for using it, but I think "Defaults to the number of available cores minus 1." could explicitly mention that it takes into account queues of common job scheduler into account. Sorry, I didn't express myself clearer.

I think it would be helpful if the articles include also how to arribe to something like Figure 2 or link to how to do that in the use case article.

Eventually it will. The package we used for vis is already in the review by CRAN. The Use cases article is reproducible, even though one does need a lot of extra packages some of which are still in dev versions.

Great! Looking forward to it!

On a more general note, you might be interested in duckplyr.

Have you used it? I've read @JosiahParry's post about it a few month ago. To quote: "The R package duckplyr is a drop in replacement for dplyr. duckplyr operates only on data.frame objects and, as of today, only works with in memory data. This means it is limited to the size of your machine’s RAM."

So as far as I got it, it makes no sense to use it on DuckDB that I am creating, because duckplyr is kind of like data.table, as it creates a DuckDB in-memory, but we don't really need that, as we are already creating a DuckDB ourselves. So one does not really seem to need duckplyr if they have converted the data to DuckDB... I may be wrong. Let me know if I'm misinterpreting and misunderstanding duckplyr.

No, you got it right and I got it wrong, I though that duckplyr would allow to operate with larger than RAM data by creating the databases. Apologies for the noise.

Robinlovelace commented 2 months ago

I profoundly dislike the data origin and method used to obtain it

I see zero problems with that...

I don't want to make a big deal of this, but the data was obtained without the consent of the people using a state of alarm (estado de alarma) that was later ruled as unconstitutional.

Just to pick up on this as someone who has seen this conversation before in the UK and offer a perspective. These datasets may already be collected without consent at scale by multinational corporations worldwide. This may have been happening for many years, without people's consent and usually without their knowledge. Some CDR data owners may already sell their datasets commercially. I sense your dislike is for the existing status of CDR data uses, transfers and sales, something I and many would share.

What is different about the datasets we're talking about here, and other anonymised and aggregated datasets made available by public bodies operating for the public interest, is that they are being made available not just to commercial interests or the "highest bidder" but to everyone. Explicitly for research and the public interest.

Taking a step back and considering the wider landscape of CDR data usage and for-profit sales, this initiative is a step forward in ethical terms, aligned with the paper "Toward an Ethically Founded Framework for the Use of Mobile Phone Call Detail Records in Health Research". By framing these datasets as a public asset for research and more evidence-based policies, for example to support policies that will speed-up decarbonisation of transport systems (my field of research), we can increase the value that the public gains from them and possibly mitigate potential harms already being done by the status quo.

I think there is a wider literature and debate around data ethics, I'm not an expert in it, but great to raise the ethical dimension which is a strength of this dataset compared with current practices in many countries, as far as I'm aware. I'm not aware of the "unconstitutional" aspect of your comment. Keen to learn more about this and other factors that should be considered in relation to ethical dimensions of data access and use.

e-kotov commented 2 months ago

@llrs

"Defaults to the number of available cores minus 1." could explicitly mention that it takes into account queues of common job scheduler into account. Sorry, I didn't express myself clearer.

Got it! And I have now had time to read the article. Basically they the same argumentation as I arrived to through experience. So I agree. I will extend the description along the lines you suggested.

Apologies for the noise.

No worries! It's a learning process for all of us.

I don't want to make a big deal of this, but the data was obtained without the consent of the people using a state of alarm (estado de alarma) that was later ruled as unconstitutional. I work in a place were user consent is very important, specially for identifiable data even if it is released in a form that doesn't allow identification of people.

First, same as Robin, I also think that these datasets are ok to publish, as this data is absolutely 100% circulating anyway between various parties behind the scenes, without the general public even knowing. At least this way the data is shared with the public.

There are ethical ways to collect this data, all of which involve processing and anonymising at the source. This is something Nommon and Positium are doing in the https://cros.ec.europa.eu/multi-mno-project project to make mobile data statistics part of regular official statistics, and similar approach is implemented by https://github.com/Flowminder with open-source tools for many mobile phone operators in countries outside EU/EEA.

Enchufa2 commented 2 months ago

Hi all, maintainer of @rOpenSpain here. :wave: This looks amazing, thanks for putting such a useful package together! @rOpenSpain's mission is to act as a hub for packages that in some form enable the use of Spanish data, for visibility and accessibility. So if you wish to host it there, we would be more than happy to have you! Just let me know and we'll make it happen. But please don't feel obliged in any way.

Some comments/feedback:

I don't want to make a big deal of this, but the data was obtained without the consent of the people using a state of alarm (estado de alarma) that was later ruled as unconstitutional.

@llrs I think you are confusing some things. On the one hand, with respect to the confinement, 1) the court did not dispute the need to take the measures that were taken, but the legal instruments to do so. In other words, the estado de alarma was not enough to limit the mobility in the way they did, but they should have issued an estado de excepción instead. But anyway, 2) this data collection has nothing to do with the COVID-19 confinements and the estado de alarma, so I'm not sure why you think it is. In fact, the initial pilot was from 2017 if I'm not mistaken.

On the other hand, 3) these data have a public interest, and no personal information is published, and 4) GDPR doesn't require consent for data processing if it has a public interest. It may seem that GDPR is all about consent, but consent is just one of the legal basis that enables data processing.

Are you talking about default path? By default, CRAN seems to prohibit writing anywhere but into temp (though I may be https://github.com/e-kotov/rJavaEnv/issues/45 by default. Or are you suggesting we download the files into some folder defined by tools::R_user_dir instead of setting the directory with env var or package option? If so, I would be very hesitant to write so much data into user_dir by default. I strongly think that with this size of required download, the user should be very explicit where they want to dump the data.

@e-kotov tools::R_user_dir() can be used to "store user-specific data, configuration and cache files [...], provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material)." But I agree that this is a lot of data and so populating the user dir inadvertently is not great. It's better if the user actively manages the path. E.g. this is what the CatastRo package does: uses the tempdir by default and allows the user to set a different, permanent path.

e-kotov commented 2 months ago

@e-kotov tools::R_user_dir() can be used to "store user-specific data, configuration and cache files [...], provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material)." But I agree that this is a lot of data and so populating the user dir inadvertently is not great. It's better if the user actively manages the path. E.g. this is what the CatastRo package does: uses the tempdir by default and allows the user to set a different, permanent path.

good to know!

In fact, the initial pilot was from 2017 if I'm not mistaken.

yep, that was province level data.

llrs commented 2 months ago

Thanks all for your input about the data ethics and legislation in this context (I'll read more about it). I'm glad that Iñaki, who is leading the organization, popped in. On the slack there was also some interest on the package too.

Robinlovelace commented 2 months ago

Second that, many thanks @Enchufa2, great there's interest!

Robinlovelace commented 2 months ago

Hi @llrs and @Enchufa2, after conversation with @e-kotov: we'd like to take you up on the offer of migrating this repo if it still stands and assuming we'll still have admin rights over it as with rOpenSci. Totally makes sense to have this codebase, developed for community benefit based on Spanish data, with most relevance for people living in Spain, to live in a Spanish community benefit org.

Enchufa2 commented 2 months ago

Great! Of course it still stands, and of course you'll be admin of the repo. As soon as GitHub notifies me that you did the repo transfer, I'll access the settings to make you admin (because this is not by default, which always bugs me, but it is what it is...).

Enchufa2 commented 2 months ago

I just sent invites to both of you.

Robinlovelace commented 2 months ago

Thanks for quick reply! Will transfer it right now.

Robinlovelace commented 2 months ago

Move is done! All that's left to do is change the URLs, as per edited original post: https://github.com/rOpenSpain/spanishoddata/issues/14

Anything else we should change? Thanks for quick process 🙏

Robinlovelace commented 2 months ago

Need to update this in the GitHub profile, I don't think I'm admin yet but can change when I am (or Egor): https://ropenspain.github.io/spanishoddata/

e-kotov commented 2 months ago

@Robinlovelace

Need to update this in the GitHub profile, I don't think I'm admin yet but can change when I am (or Egor): https://ropenspain.github.io/spanishoddata/

do you think we'll be able to forward https://robinlovelace.github.io/spanishoddata/ to https://ropenspain.github.io/spanishoddata/ ?

I guess if you host a new repo at the old address with gh-pages and html like this:

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="refresh" content="0; url=https://ropenspain.github.io/spanishoddata/">
    <title>Page has moved</title>
</head>
<body>
    <p>If you are not redirected, <a href="https://ropenspain.github.io/spanishoddata/">click here</a>.</p>
</body>
</html>

It should work

Enchufa2 commented 2 months ago

@Robinlovelace You are admin now. Not sure if @e-kotov was too, but now you have full powers to change that too.

llrs commented 2 months ago

We usually use a specific pkgdown template: https://github.com/rOpenSpain/rostemplate. See the package for how to use it.

The group also uses r-universe to provide binaries, if you want you can add yourself to the list: https://github.com/rOpenSpain/rOpenSpain.r-universe.dev/blob/master/packages.json This also makes it more discoverable.

And I think you are not yet on the slack (if you want to be): @Enchufa2 Are you the one that sends invitations?

Enchufa2 commented 2 months ago

Yeah, please let me know if you want to be added to our Slack workspace too and I'll be happy to invite you. :)

e-kotov commented 2 months ago

@Enchufa2 sure, add me in.

I still do not see the repo settings button or an option to change the url to website on the right.

e-kotov commented 2 months ago

I still do not see the repo settings button or an option to change the url to website on the right.

now i do

Enchufa2 commented 2 months ago

now i do

Yes, now you are admin too. Also I sent you a Slack invite to your maintainer email.

Robinlovelace commented 2 months ago

👍 quiero juntar con vosotros en el Slack (me gusta practicar espannol, aunque no tengo el ennye en este teclado 😆 )

Enchufa2 commented 2 months ago

👍 quiero juntar con vosotros en el Slack (me gusta practicar espannol, aunque no tengo el ennye en este teclado 😆 )

:D :D :D ¡Invitado también! ;-)

Robinlovelace commented 2 months ago

Closed with https://github.com/rOpenSpain/spanishoddata/pull/74#event-14149830143 creo 👍

e-kotov commented 2 months ago

We usually use a specific pkgdown template: https://github.com/rOpenSpain/rostemplate. See the package for how to use it.

@llrs will check that soon, great looking template!

e-kotov commented 2 months ago

@llrs will check that soon, great looking template!

@llrs done https://ropenspain.github.io/spanishoddata/

e-kotov commented 2 months ago

@Robinlovelace

Need to update this in the GitHub profile, I don't think I'm admin yet but can change when I am (or Egor): https://ropenspain.github.io/spanishoddata/

do you think we'll be able to forward https://robinlovelace.github.io/spanishoddata/ to https://ropenspain.github.io/spanishoddata/ ?

I guess if you host a new repo at the old address with gh-pages and html like this:

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="refresh" content="0; url=https://ropenspain.github.io/spanishoddata/">
    <title>Page has moved</title>
</head>
<body>
    <p>If you are not redirected, <a href="https://ropenspain.github.io/spanishoddata/">click here</a>.</p>
</body>
</html>

It should work

@Robinlovelace actually, maybe it is better to not do it, because if you recreate your https://github.com/Robinlovelace/spanishoddata/ repo, it will not redirect to the new location at rOpenSpain ... I wish GitHub made it easier for gh-pages redirects.

Robinlovelace commented 2 months ago

Yeah and happily https://www.google.com/search?client=ubuntu-sn&channel=fs&q=spanishoddata hasn't picked-up on it yet, we should change that!

e-kotov commented 2 months ago

change that!

@Robinlovelace mmm, change what? Sorry, did not get it. I would keep things as they are now. We will circulate the new link to the gh pages website in future social media posts and the old posts will be quickly forgotten anyway, so I guess we can just do nothing. In this case we will at least keep the redirect from your old repo to the new location here in rOpenSpain.

Robinlovelace commented 2 months ago

change that!

@Robinlovelace mmm, change what?

Change the fact that there are no links to our package in the link.

Robinlovelace commented 2 months ago

I.e. by promoting it. I'm talking about changing the quantity and quality of links to the package, not the package itself.

e-kotov commented 2 months ago

@Robinlovelace got it. I updated the links in my original posts on linked in and mastodon, not on X/twitter, because, well it's X/twitter :) For promotion next week I would use our new visualisations from the flows vignettes

Robinlovelace commented 2 months ago

Good point about the tweets I forgot about those!