implement blogpost feedback from Megan

pasting the email here to make it easier to track our to-dos:

Hi Tarak and Ayyub,

Thanks again for writing and sharing this technical blogpost! Overall, I think it’s great and a very clear description of a very complicated process.

To your specific questions below, I think the approach is very well motivated by IPNO’s work. I can’t speak to whether you’re missing any further exploratory steps or performance evaluation ideas (it all makes sense to me, but it’s also new to me, so I’m not sure I would notice any missing options).

Beyond that, all of my comments/suggestions are about language and structure, which are largely matters of personal preference, so feel free to use/disregard as you prefer:

I think the post itself could use a tiny bit more intro/framing, but it’s possible that could come from a more traditional blogpost linking to the technical post (like we did with Tarak’s technical post about scraping documents here)

the first sentence uses the word vindicate twice; possible to replace the first use with a different word?

I would take another sentence to lead into the paragraph after 2. Page classification. Maybe something like “Page classification involves building a classification model to categorize files into these different types of documents. One approach is to fine-tune a pretrained convolutional neural network to label thumbnail images of document pages, as described in Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval.”

I also wonder if you could add a sentence here explicitly stating that since thumbnails are smaller files, the processing is faster/uses fewer resources, hence processing the thumbnails is preferable

I think the regex counterexample is useful, but I’m not sure that the evaluation statistics need to be presented twice. I would suggest moving the explanatory paragraphs about the different statistics to the corresponding input/output boxes with the summary tables (12 and 13). (Rather than having the bullet point results with explanatory sections, then the summary tables repeating the statistics)

I bumped a little on your description of cross-referencing LLEAD as a way to filter out false positives. Is it possible some officers would be named in exoneration records who are not currently included in LLEAD? For this particular analysis, if we’re only interested in officers that appear in LLEAD I think that might need to be spelled out a bit more or reiterated, since I had lost track of that by the time I got to the process_single_document function.

the closing section on evaluation, issues, and improvements implies that these are pointers to next steps, but I think it could use a closing sentence or paragraph to really wrap things up

this is a rather long post (which is fine, there’s a lot to cover). Are there logical ways to split it into multiple parts? I have some ideas, but curious to hear your reaction to a potentially multi-part post before I suggest them. :-)

ipno-llead / US-IPNO-exonerations

implement blogpost feedback from Megan #17