Open tarakc02 opened 1 year ago
The Bert summarizer has limited deduplication ability because: 1) multiple officers can be treated as one officer, 2) misspelled names, 3) multiple titles.
^ understood. It seems like our ideal outcome from deduplication would cover everything the summarizer is already doing plus the additional within-document deduplication you've identified, along with the across-document deduplication. so do we consider the summarizer step as a kind of pre-processing of document data for the deduplication step? what would happen if we just took the responses to all of the prompts and fed everything into the deduplication step -- wouldn't that also fix our issue of multiple responses across prompts? not advocating for that, just a thought experiment to better understand what role the summarizer plays for us.
Yes, definitely. The summarizer was integrated early on as an experiment. I'm for thinking about it's inclusion/exclusion within the greater deduplication task, as opposed to a distinct process.
makes sense! it makes me wonder about the usefulness of a summarizer more generally as a step in deduplication. Like we ask an encoder for a canonical representation of each input record which then solves deduplication. I guess it is a little bit similar to the idea of using auto-encoders for record linkage (for example). anyways, thanks for helping think through how things work. will continue to put more general/brainstorm questions in this thread.
using this issue to track thoughts/notes about the tech-corner blog post that are more than just a simple edit.
First q: what is the relationship between the bert summarizer step and the deduplication step? does the summarizing improve our ability to deduplicate? or do we not need to deduplicate within documents after doing the summarizing?