pydatabangalore / talks

Talks at PyData Bangalore meetups
MIT License
36 stars 11 forks source link

Approximate deduplication at scale: LSH to the rescue #7

Closed chiragyadav closed 5 years ago

chiragyadav commented 5 years ago

Approximate deduplication at scale: LSH to the rescue

Description

Recent advancements in deep learning have opened a pandora box of applications utilising NLP techniques to solve business problems. But one of the important tasks starts at the pre-processing stage which involves deduplicating similar documents. Though one can use various algorithms like Jaccard distance, cosine distance, Jaro Winkler, Levenshtein distance(depending on the document size and the nature of data) to find the similarity among documents, scaling them to a dataset of around millions is not a very time optimised approach as the number of operations scale in the order of O(N^2). In this talk I will talk about LSH and minhash based deduplication approach which at a small comprise on accuracy can quickly reduce the problem to O(N) complexity, which when put in actual numbers reduced our pre-processing time from around 48 hours to around 15 mins for a problem involving deduplicating millions of company names.

Duration

Audience

This talk is intended for people who are interested in ML engineering especially in NLP domain and basic familiarity in Python, probability and algorithms should be sufficient enough.

Outline

About Myself

I have around 5 years of experience in machine Learning domain with exposure to multiple industries like Fintech, Insurtech and eCommerce. I personally like to work on developing machine learning products and one of the product we developed at my last company is currently used on millions of financial transactions daily. LinkedIn: https://www.linkedin.com/in/chirag-yadav-85227340/


vinayak-mehta commented 5 years ago

@chiragyadav Can you please post the link to this talk's slides?