pasting the email here to make it easier to track our to-dos:
Hi Tarak and Ayyub,
Thanks again for writing and sharing this technical blogpost! Overall, I
think it’s great and a very clear description of a very complicated
process.
To your specific questions below, I think the approach is very well
motivated by IPNO’s work. I can’t speak to whether you’re missing
any further exploratory steps or performance evaluation ideas (it all
makes sense to me, but it’s also new to me, so I’m not sure I would
notice any missing options).
Beyond that, all of my comments/suggestions are about language and
structure, which are largely matters of personal preference, so feel
free to use/disregard as you prefer:
I think the post itself could use a tiny bit more intro/framing, but
it’s possible that could come from a more traditional blogpost linking
to the technical post (like we did with Tarak’s technical post about
scraping documents
here)
the first sentence uses the word vindicate twice; possible to replace
the first use with a different word?
I would take another sentence to lead into the paragraph after 2. Page
classification. Maybe something like “Page classification involves
building a classification model to categorize files into these different
types of documents. One approach is to fine-tune a pretrained
convolutional neural network to label thumbnail images of document
pages, as described in Evaluation of Deep Convolutional Nets for
Document Image Classification and Retrieval.”
I also wonder if you could add a sentence here explicitly stating that
since thumbnails are smaller files, the processing is faster/uses fewer
resources, hence processing the thumbnails is preferable
I think the regex counterexample is useful, but I’m not sure that
the evaluation statistics need to be presented twice. I would suggest
moving the explanatory paragraphs about the different statistics to the
corresponding input/output boxes with the summary tables (12 and 13).
(Rather than having the bullet point results with explanatory sections,
then the summary tables repeating the statistics)
I bumped a little on your description of cross-referencing LLEAD as a
way to filter out false positives. Is it possible some officers would be
named in exoneration records who are not currently included in LLEAD?
For this particular analysis, if we’re only interested in officers
that appear in LLEAD I think that might need to be spelled out a bit
more or reiterated, since I had lost track of that by the time I got to
the process_single_document function.
the closing section on evaluation, issues, and improvements implies
that these are pointers to next steps, but I think it could use a
closing sentence or paragraph to really wrap things up
this is a rather long post (which is fine, there’s a lot to cover).
Are there logical ways to split it into multiple parts? I have some
ideas, but curious to hear your reaction to a potentially multi-part
post before I suggest them. :-)
pasting the email here to make it easier to track our to-dos: