pydatabangalore / talks

Talks at PyData Bangalore meetups
MIT License
36 stars 11 forks source link

FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks #12

Closed thakur-nandan closed 5 years ago

thakur-nandan commented 5 years ago

Title

FlashText – A Python Library 28x faster than Regular Expressions for NLP tasks

Description

Data Science starts with data cleaning. When developers are working with text, they often clean it up first. Sometimes by replacing keywords (“Javascript” with “JavaScript”) while other times, to find out whether a keyword (“JavaScript”) was mentioned in a document. In today’s fast-moving world, bigger and bigger datasets are coming up with tens of thousands to millions of documents. the amount of time one would want to invest in cleaning these gigantic datasets would take them days using RegEx (5 days ~ 20K keywords and 3 Million documents). Therefore, FlashText - a super blazingly fast library reduced days of computation time into few minutes (15mins ~ 20K keywords and 3 Million documents). FlashText can search and replace keywords from text really fast and has been implemented using the Aho-Corasick algorithm and the Trie Data Structure approach.

Duration

Audience

This talk is centered around people who are interested in ML engineering especially in NLP domain and basic familiarity in Python, Dictionaries, and Regex should be sufficient enough.

Outline

Slides can be viewed here -https://docs.google.com/presentation/d/1qv0EKUCmjcvbIMDJSfUYvmpG_nlmFznZzQOM14JEyZE/edit?usp=sharing

[0-3mins]: Brief Introduction about Myself. Introduction to FlashText and compare FlashText vs. Regular Expressions Performance.

[3-10mins]: How is FlashText so blazingly fast?

[10-15mins]: When to Use FlashText?

[15-20mins]: Installing FlashText.

[20-24mins]: UseCase 1: Code – Searching for words in a text document

[24-28mins]: UseCase 2: Code – Replacing words in a text document

[28-30mins]: End Notes and Feedback for Future Talks.

Additional notes

The repository has over 2700+ Stars on GitHub and 15,000+ claps on Medium. Radim Rehurek (Founder of RaRe Technologies (Gensim)) has tweeted about this repository here: https://twitter.com/RadimRehurek/status/904989624589803520

Medium Article: https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/ (Over 15,000+ Claps)

GitHub Repo: https://github.com/vi3k6i5/flashtext (Over 2700+ Stars)

FlashText Documentation: https://buildmedia.readthedocs.org/media/pdf/flashtext/latest/flashtext.pdf

FlashText Research Paper: https://arxiv.org/pdf/1711.00046.pdf

LinkedIn: https://linkedin.com/in/nthakur20/

Video Preview: https://youtu.be/s8WP79QU1zw

Slides: https://docs.google.com/presentation/d/1qv0EKUCmjcvbIMDJSfUYvmpG_nlmFznZzQOM14JEyZE/edit?usp=sharing

About Me: My Name is Nandan Thakur, A BITS Graduate currently working as a Data Scientist (RnD) in Knolskape, Bangalore. I am a perpetual, quick learner and keen to explore the realm of Data Analytics and Science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed in domains such as Natural Language Processing, Machine Learning, and Signal Processing and share a keen interest in learning interdisciplinary concepts involving Machine Learning. I am looking forward to being involved in more tech meetups and contribute to more open-source actively.

NirantK commented 5 years ago

Hiya fellow Belonger @NThakur20 ! Please do say hello if you attend the meetup on 13th July!

thakur-nandan commented 5 years ago

Sure @NirantK will do :+1:

vinayak-mehta commented 5 years ago

@NThakur20 Are you available to present it this Saturday?

thakur-nandan commented 5 years ago

Hey @vinayak-mehta sure will be free this Saturday to present this talk.

vinayak-mehta commented 5 years ago

Awesome, see you on Saturday! I've updated the agenda on the Meetup event.