kthrog / VaxStats

VaxStats: an open vaccination data repository for the PNW + a data curation protocol project for LIS 598 J - Advanced Data Curation. Made by Alexis McClimans, Karalyn Ostler, and Kaitlin Throgmorton.
https://vaxstats.gitbook.io/vax-stats/
0 stars 0 forks source link

Curated datasets #13

Closed kthrog closed 5 years ago

kthrog commented 5 years ago

@nniiicc We're thinking we'll use Tabula to to transform PDF datasets that we have so that we have at least a bunch of CSV files, and then we'll make at least one curated dataset of all vaccination data across the PNW. Can explain this more on the call tomorrow -- but want to know what else we should do?

nniiicc commented 5 years ago

👍

kostler commented 5 years ago

@nniiicc I was wondering if there was much of a difference between .csv and .tsv files. From what I have seen, it seems like the .tsv files can be better for files that have commas in the actual data (like most of our data), so I have been trying to convert our PDF files to .tsv. However, several of the tables in the PDFs have required more hands on work for converting headings and I have been opening them in Excel to more easily edit the files, which does not have an option to convert to .tsv, only .csv. So I am ending up with a mix of .tsv and .csv files, which are arguably both good file types but I could see a mix in file types being annoying for people trying to upload more than one of these files into software. Do you have any thoughts on this? Is it an issue or am I thinking too hard?

nniiicc commented 5 years ago

Good questions...

There isn't really a difference between the two other than how they are written to handle variable escapes. To use a CSV with commas you would simply put a quote around the values... So for example something, something else, "something, else and the other thing", some other thing entirely (For a little more detail see: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file)

Mixing CSVs and TSVs probably isn't ideal for an end user. The good news is that you can easily convert these by simply saving a file as .csv or .tsv ... If you're on a mac just select all the files you want to convert, right click and select "Show all file name extensions" - then you can edit from there. You can also do this on the command line... If you switch to the folder where the file is located you can do something like for file in *.tsv; do mv "$file" "${file%.tsv}.csv"; done

kostler commented 5 years ago

Oregon and Idaho .csv files with only table data uploaded. Washington needs to be saved into .csv files. I will start on the README file tomorrow.

kostler commented 5 years ago

Datasets done!