nsaef / text_exploration

Tool for analyzing big unstructred collections of digital text documents. Master thesis in Digital Humanities.
3 stars 0 forks source link

Mark duplicates and versions #44

Closed nsaef closed 6 years ago

nsaef commented 6 years ago

Document model: add properties "duplicate_of" and "version_candidate_of" (Foreign Key)

  1. Test various hashing options, compare results, see which one identifies versions most reliably
  2. Check data type of hash values
  3. Update model with fields above + field "hash"
  4. Add button "find duplicates" and "find version candidates"
nsaef commented 6 years ago

Status: Library implemented successfully. Next step: integrate in Django and test results.

Trying other algorithms than MinHash might be helpful (like https://ekzhu.github.io/datasketch/lshensemble.html)

nsaef commented 6 years ago

MinHash works. No success with LSHENsemble (yet). Closing for now.