Analysis-ready dataset of population, employment, and travel times in Toronto

rfordatascience / tidytuesday

Official repo for the #tidytuesday project

Creative Commons Zero v1.0 Universal

6.71k stars 2.38k forks source link

Analysis-ready dataset of population, employment, and travel times in Toronto #702

Open paezha opened 2 months ago

paezha commented 2 months ago

Dataset name: TTS2016R Dataset download URL: https://soukhova.github.io/TTS2016R/ Article that demonstrates the dataset: https://doi.org/10.1177/23998083241242844 Cleaning script: The data are analysis-ready.

Data dictionary: All variables are documented in the package.

jonthegeek commented 2 months ago

@paezha The DOI is the article from #701. I see in the package that it's supposed to be https://doi.org/10.1177/23998083221146781, though. Thanks!

lgibson7 commented 3 weeks ago

[x] I can download the dataset from the link provided.
[ ] The dataset will (probably) be less than 50MB when saved as a tidy CSV.
[ ] There is a link to an article that has something to do with the dataset.
[x] I can imagine a data visualization related to this dataset.
[ ] This dataset has not already been used in TidyTuesday.
[ ] ALT text is provided for all (both) images.
[x] There is a data dictionary describing the columns of the dataset.
[x] The TidyTuesday maintainers are unlikely to get sued for using the dataset.

lgibson7 commented 3 weeks ago

Hi @paezha. Thanks for submitting this issue. Would you be willing to submit the data set through a PR? You can find the instructions on how to do so here.

paezha commented 3 weeks ago

Hi @lgibson7 - Happy to submit the dataset. It is already an R package, though, so I am unsure how many, if any, of the steps outlined here are needed. For example, the data files are already clean and saved in native R format.

jonthegeek commented 3 weeks ago

@paezha Regardless of the source, we share the datasets as one or more CSVs. When the data comes from a package, the cleaning script will likely be very short, along the lines of this:

# Clean data from pkgname (https://pkgurl)
toronto_population <- pkgname::toronto_population
toronto_employment <- pkgname::toronto_employment
toronto_travel <- pkgname::toronto_travel

It's very similar to situations where the data is cleanly available as CSVs, such as the recent American Idol dataset.

The cleaning might also be more complicated, to take a subset of the data or otherwise make it more CSV-friendly, such as what I did to share our own data from our ttmeta package.

I hope that helps explain the process!