ipno-llead / extraction

Extraction repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database
2 stars 0 forks source link

Document index export #8

Closed rajivsinclair closed 3 years ago

rajivsinclair commented 3 years ago
tarakc02 commented 3 years ago

I had something unexpected come up and am running a bit behind, but I should have this done tomorrow morning.

tarakc02 commented 3 years ago

The latest commit creates the subsetted pdfs, uploads them to Dropbox, and creates a table matching the one on wrgl with the additional dropbox identifier/hash fields. I added "hrg_text", which is just the text of the extracted hearing -- that might be more useful for searching than "ocr_text" which is the full text of the document (as before).

Anyway, @rajivsinclair -- can you add me to the wrgl repo? My username is tarakc02. Once I can, I'll set up some code to push the table on every update. Until then, I've uploaded documents.csv to /ppact/meeting-minutes-extraction/export

rajivsinclair commented 3 years ago

Sending you an invite now. It should arrive in your email inbox.

tarakc02 commented 3 years ago

done! now when we run make from the minutes root directory, every data processing step starting with downloading from dropbox and ending with uploading new document pdfs to dropbox and pushing the extracts with the associated dropbox links to wrgl is run automatically, with local output in the export directory.

I had to change the primary key because there are multiple hearings for some docids.

Running the export task requires both dropbox and wrgl credentials to be placed into the frozen directory, I'll update the README when I have a moment.