I'm not sure how you guys go about this, but one of the pain points for journalists when working with data is to conduct integrity checks on the data. Whether the dataset has 10 columns or 150, our standard practice is to group by each column and produce a table of either all results or the most common results.
We typically might do this in SQL
SELECT state, COUNT(*) FROM bridges GROUP BY state ORDER BY COUNT(*) DESC;
or in R:
bridges %>% group_by(state) %>% summarize(count = n()) %>% arrange(desc(count))
I might just be missing a way to loop through the data properly, but would love to develop (or discover a pre-existing way) to use a function to group by all columns, and produce a text file or report with information about each column grouping.
I'm not sure how you guys go about this, but one of the pain points for journalists when working with data is to conduct integrity checks on the data. Whether the dataset has 10 columns or 150, our standard practice is to group by each column and produce a table of either all results or the most common results.
We typically might do this in SQL
SELECT state, COUNT(*) FROM bridges GROUP BY state ORDER BY COUNT(*) DESC;
or in R:
bridges %>% group_by(state) %>% summarize(count = n()) %>% arrange(desc(count))
I might just be missing a way to loop through the data properly, but would love to develop (or discover a pre-existing way) to use a function to group by all columns, and produce a text file or report with information about each column grouping.