Datasets - Githubissues

Nowosad commented 4 years ago

At least three different scales:

Global/continental
Regional/country level
Local

Each level should have complete set of possible spatial object types with interesting attributes:

Points
Lines
Polygons
Categorical raster(s)
Continuous raster(s)

At least one of the scales should also have some temporal variables to showcase tmap's animation capabilities.

mtennekes commented 4 years ago

Yes. We have to reduce the number of datasets in a smart way, since 3x5=15 is too much in my opinion.

I think we should aim for 3 topics/applications, one for each scale. Each topic is then covered with as few datasets as possible (i.e. such that is covers our needs).

Global

On country level, the World Bank has the richest dataset: 1000 World development indicators per country per year (= time series). http://datatopics.worldbank.org/world-development-indicators/
Penn World Table (https://www.rug.nl/ggdc/productivity/pwt/), an economic dataset.
World Value Survey (http://www.worldvaluessurvey.org/), a world-wide survey of cultural values
Continental: http://afrobarometer.org/

Regional / country Have to look for suitable data. The only option I currently have is Dutch commuting data. It contains numbers of commuters between municipalities (400 in total), by mode of transport.

Local We can analyze an satellite image of air pollution, and use OSM vector data as reference. For instance plot main (rail)roads and important buildings like schools. Satelite images from different moments in time would also be awesome (e.g. pre, during, and post COVID).

Although it is not the focus of the book, I think it's nice to have three different hot topics, like e.g. health (global), transport (country), and climate (local).

Nowosad commented 4 years ago

@mtennekes 15 datasets sound like a lot, but I tried to count (in memory) datasets used in geocompr, and there we used more than 20 datasets in the first eight chapters. However, I also think that adding datasets and modifying them (e.g. adding/removing variables, changing projections, etc.) is an incremental process. We will see what is missing while writing the book and then we can add it. We just need a starting point for now.

I like the idea of three different topics a lot. It is great!

Few remarks:

Fill free to start downloading the data (especially the ones on global and regional levels)
Do you have any suggestions for the location of local data?
For the local level, we can also add some categorical rasters (land cover/land use).

@zross what do you think?

zross commented 4 years ago

A couple of thoughts:

In my experience, coming up with an "analysis" to do makes things a bit more interesting and real world. Simply putting bubble points on a global map, I don't think, will be as compelling.
I think starting with a topic would be the way I would prefer to do it, but practically-speaking, I think we may need to pick at least one dataset by location -- picking a location with pretty much any kind of dataset we can envision. This way, if we decide we need to include a land use layer, a tree layer, a hospital layer -- whatever -- we can be confident that data would be available. NYC, London etc.
I wonder if we could come up with an unexpected place/topic. Like if we did something with Africa, instead of looking at climate or poverty or something like that we pick UNESCO heritage sites or beautiful parks or first archaeological find. I don't know. For the workshop I did at the RStudio conference, partly, I used data on burrito restaurants in San Francisco from {yelpr}. Road density near the restaurants, number of restaurants per neighborhood. That kind of thing and people enjoyed that.

Most of my own work and experience is with the US and we absolutely need to pick an less covered area also but in terms of what I know:

My own expertise is air quality. I could easily come up with air quality-relate data for any place, any resolution. I'm currently working on a project on the global burden of air quality and have tons of useful global data from the Institute for Health Metrics and Evaluation.
My own expertise is also NYC. As you might guess, NYC has a ton of amazing and interesting data. For my Datacamp course I used a census of trees which is a nice dataset.
My wife works at a famous bird laboratory and they have amazing data. This person is someone I know at that lab and he could probably help us get some interesting data for anywhere in the world.

Nowosad commented 4 years ago

Hi @zross, great points. How about we split the work here?:

@mtennekes could prepare regional level data - we have already discussed having Dutch commuting data (and I think it is a very good idea) + it could be a good example for interactive maps
@zross could think about some local data (as a city dweller, I would welcome some air quality visualizations). We could have some spatiotemporal examples (facets), animations, and we could also present the points with tiles in the background (static maps) and some rasters (elevation, land use)
I could work on global data preparation. It could be used to present different projections plus it could be good as an application of using the shiny package

What do you think about that?

mtennekes commented 4 years ago

Agree with both of you.

I imagine that the bird datasets that Zev mentioned will be very interesting. And also something completely different (for most people at least). And it is still relevant (I mean the burrito dataset would be fun for sure, but I like topics that have impact).

I will prepare the Dutch commuting data. Not sure if it will work though, since it needs a lot of data processing to turn data into a useful map. For this purpose, I've started a new (small) package to handle this kind of OD data. Maybe I can use an already processed version of the data.

Air quality data is good to have. @zross I don't have a preference for a location for local scale: NYC is fine with me!

Nowosad commented 4 years ago

Hi @mtennekes and @zross,

I have started working on preparing global data using world borders from NaturalEarth and additional attributes from Gapminder. You can see it at https://github.com/r-tmap/tmap-data.

Please take a look at the code at https://github.com/r-tmap/tmap-data/blob/master/R/01-prepare-world.R.

My comments and questions:

I have slightly modified the NatualEarth data to be more consistent with Gapminder. Let me know what you think about it.
I added several attributes, including (a) World Bank regions, (b) World Bank income groups, (c) Total population, (d) CO 2 emissions, (e) GDP per capita, (f) Life expectancy, (g) Corruption Perception Index, (h) Democracy score, (i) HDI, (j) Energy use, and (k) Literacy rate. What do you think about this list? Should I add or remove something?
We could also create some spatiotemporal variables (one of the above attributes for a few years) to present some tmap capabalities, such as making animations.
What should be the map projection used for the global dataset?

Overall, I also think that we can (and will) modify and improve datasets while writing the book, but it will be nice to have an agreed alpha version.

Best, J.

mtennekes commented 4 years ago

Great work!

Do you mean the assignment of subcountries, #Puerto Rico -> USA etc.? Good idea. We can finetune it later.
I find two of the added variables very interesting: corruption and democracy (see also below). Generally speaking , the other variables don't add much news in comparison to tmap::World and spdata::world. And for energy use and CO2 emissions, I wouldn't use countries borders, but a more detailed spatial resolution that also shows metropolitan areas.
Yes, that would be awesome. If I have time, I can also take a look.
Good question. A projection that has equal-area property is almost a must, especially for choropleths. I looked around, and the relatively new "Equal Earth" property seems the way to go. However, I got some warnings when applying st_transform. I noticed that there is little difference with my old favourite, Eckert IV, which I used for tmap::World:

I played around with this dataset, and created a composite indicator:

world_all2 = world_all %>% 
  sf::st_transform(crs = "+proj=eck4") %>% 
  sf::st_make_valid() %>% 
  mutate(demo_corr = democracy_score * 2.5 + 25 + corruption_perception_index / 2,
         demo_corr_rank = rank(-demo_corr, ties.method = "min"))

tmap_options(projection = 0, basemaps = NULL) # github version of tmap needed

tm_shape(world_all2) + 
tm_polygons("demo_corr", style = "cont", 
    popup.vars = c("democracy_score", "corruption_perception_index",
    "demo_corr","demo_corr_rank"), id = "name")

Screenshot from 2020-06-03 19-25-37

Nowosad commented 4 years ago

Great. I have updated the code a little bit yesterday. I think it is a good starting point for the world data.

r-tmap / tmap-book

Datasets #1