Closed rajivsinclair closed 3 years ago
I had something unexpected come up and am running a bit behind, but I should have this done tomorrow morning.
The latest commit creates the subsetted pdfs, uploads them to Dropbox, and creates a table matching the one on wrgl with the additional dropbox identifier/hash fields. I added "hrg_text", which is just the text of the extracted hearing -- that might be more useful for searching than "ocr_text" which is the full text of the document (as before).
Anyway, @rajivsinclair -- can you add me to the wrgl repo? My username is tarakc02
. Once I can, I'll set up some code to push the table on every update. Until then, I've uploaded documents.csv
to /ppact/meeting-minutes-extraction/export
Sending you an invite now. It should arrive in your email inbox.
done! now when we run make
from the minutes
root directory, every data processing step starting with downloading from dropbox and ending with uploading new document pdfs to dropbox and pushing the extracts with the associated dropbox links to wrgl is run automatically, with local output in the export
directory.
I had to change the primary key because there are multiple hearings for some docids.
Running the export
task requires both dropbox and wrgl credentials to be placed into the frozen directory, I'll update the README when I have a moment.
Generate a new PDF file for each
docid
(with only the pages for that document)Generate an index table (CSV) of all documents to match the columns in the
documents
table on WRGL with path to eachdocid
and the Dropbox file identifier and Dropbox file hashPush/commit this index table to WRGL via command line client