stephbuon / digital-history

Instructional repository for "Text Mining as Historical Method"
GNU General Public License v3.0
7 stars 3 forks source link

get houston city council data #21

Closed stephbuon closed 3 years ago

stephbuon commented 3 years ago

eric:

All the data I scraped from the Houston city councils can be found on M2 here: /scratch/group/oit_research_data/houston_city_council_minutes

This directory also has all the Dallas city council minutes as well as the data (Reddit, congress, covid, etc) from last semester. Let me know if you are having trouble accessing the directory.

The Houston minutes are in pdfs (what was available on their website), so you might need to figure out a good way to extract the text from them (I've used pyPDF2 in the past but there are other options) if you want students to be able to use the text with ease.

stephbuon commented 3 years ago

@alexanderr : per Eric's message above, we have a bunch of PDFs of Houston City Council Minutes. We would like an output of the PDF's contents as CSV files (or any file that would be easily read as a Panda's DataFrame).

Can you please use pyPDF2 (or a comparable module) and write this code?

Once you have, please 1) send me the location of the data and 2) upload your code to digital-history/utilities (for re-use purposes).