uncoast-unconf / uu-2019

Materials for the 2019 uncoast unconference
11 stars 4 forks source link

Improving workflow for integrity checks #26

Open charlesminshew opened 5 years ago

charlesminshew commented 5 years ago

I'm not sure how you guys go about this, but one of the pain points for journalists when working with data is to conduct integrity checks on the data. Whether the dataset has 10 columns or 150, our standard practice is to group by each column and produce a table of either all results or the most common results.

We typically might do this in SQL SELECT state, COUNT(*) FROM bridges GROUP BY state ORDER BY COUNT(*) DESC;

or in R: bridges %>% group_by(state) %>% summarize(count = n()) %>% arrange(desc(count))

I might just be missing a way to loop through the data properly, but would love to develop (or discover a pre-existing way) to use a function to group by all columns, and produce a text file or report with information about each column grouping.

iqis commented 5 years ago

I think there are some packages on CRAN that provides a framework for data QA, such as dataQualityR or datacheck.

There is also skimr, which is very high quality and robust.