Automatically cleaning unicode text

dimenwarper commented 7 years ago

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

rhiever commented 7 years ago

Sounds promising. Please submit a PR with the new functionality along with unit tests to demonstrate how it works.

dimenwarper commented 7 years ago

I've implemented a draft of this but realized it may clash with the functionality of converting all text to numerical values. I wonder how to proceed, as I see it there are two options:

Fix the text before applying the encoding: This is what I'm doing right now, so strings like >=50 and >=50'get encoded to the same label.
Make encoding optional: This is tricky, there will be some text-based columns where you want to preserve the text to featurize later (e.g. with a sklearn.feature_extraction.text.TfidfVectorizer) rather than convert them to a label with an encoder. The tricky part is how to specify what columns you want to encode or not.

One way to proceed would be to go with 1 and then tackle 2 in a later issue.

rhiever / datacleaner

Automatically cleaning unicode text #13