Approximate deduplication at scale: LSH to the rescue
Description
Recent advancements in deep learning have opened a pandora box of applications utilising NLP techniques to solve business problems. But one of the important tasks starts at the pre-processing stage which involves deduplicating similar documents. Though one can use various algorithms like Jaccard distance, cosine distance, Jaro Winkler, Levenshtein distance(depending on the document size and the nature of data) to find the similarity among documents, scaling them to a dataset of around millions is not a very time optimised approach as the number of operations scale in the order of O(N^2). In this talk I will talk about LSH and minhash based deduplication approach which at a small comprise on accuracy can quickly reduce the problem to O(N) complexity, which when put in actual numbers reduced our pre-processing time from around 48 hours to around 15 mins for a problem involving deduplicating millions of company names.
Duration
[ ] 30 min
[x] 45 min
Audience
This talk is intended for people who are interested in ML engineering especially in NLP domain and basic familiarity in Python, probability and algorithms should be sufficient enough.
Outline
detailed description of the problem statement
Application of deduplication in practical cases
Walkthrough of the results and time complexity of the solution based on LSH
Using Minhash based LSH in python
Performance improvement as compared to the naive approach
Comparative analysis of results obtained from probabilistic and deterministic approaches
Minhash based document similarity
Abstraction of Similarity of documents
Jaccard similarity and its approximation my Minhash
Theoretical understanding of how LSH works
Selecting the signature size and bucket size
About Myself
I have around 5 years of experience in machine Learning domain with exposure to multiple industries like Fintech, Insurtech and eCommerce. I personally like to work on developing machine learning products and one of the product we developed at my last company is currently used on millions of financial transactions daily.
LinkedIn: https://www.linkedin.com/in/chirag-yadav-85227340/
[ ] Don't record this talk.
Check this if you don't want your talk to be recorded.
Approximate deduplication at scale: LSH to the rescue
Description
Recent advancements in deep learning have opened a pandora box of applications utilising NLP techniques to solve business problems. But one of the important tasks starts at the pre-processing stage which involves deduplicating similar documents. Though one can use various algorithms like Jaccard distance, cosine distance, Jaro Winkler, Levenshtein distance(depending on the document size and the nature of data) to find the similarity among documents, scaling them to a dataset of around millions is not a very time optimised approach as the number of operations scale in the order of O(N^2). In this talk I will talk about LSH and minhash based deduplication approach which at a small comprise on accuracy can quickly reduce the problem to O(N) complexity, which when put in actual numbers reduced our pre-processing time from around 48 hours to around 15 mins for a problem involving deduplicating millions of company names.
Duration
Audience
Outline
About Myself
I have around 5 years of experience in machine Learning domain with exposure to multiple industries like Fintech, Insurtech and eCommerce. I personally like to work on developing machine learning products and one of the product we developed at my last company is currently used on millions of financial transactions daily. LinkedIn: https://www.linkedin.com/in/chirag-yadav-85227340/