Automating the text extraction process

opencleveland / drocer

Scraper and parser of Cleveland City Council's records and the produced text.

6 stars 6 forks source link

Open skorasaurus opened 7 years ago

skorasaurus commented 7 years ago

1] downloading the PDFs 2] parsing them to the raw text (https://github.com/opencleveland/drocer/blob/master/records.py ) in proper directories.

setting up an AWS account (or something else) to run this every so often.

calarrick commented 7 years ago

Yes. I need to (more fully) document and present the PDFBox extensions that are now doing a better job of producing raw text in corrected order.