ianozsvald / data_science_delivered

Observations from Ian on successfully delivering data science products
540 stars 98 forks source link

Data cleaning tools #3

Closed fgregg closed 8 years ago

fgregg commented 8 years ago

Really interesting read!

Don't know if you are looking for listing specific data cleaning tools, but we've built a few that are useful in our own work.

https://github.com/datamade/dedupe https://github.com/datamade/usaddress https://github.com/datamade/probablepeople https://github.com/datamade/parserator

For making reproducible data workflows, we also use Make. https://github.com/datamade/data-making-guidelines Would be interested to hear how you structure your data steps

ianozsvald commented 8 years ago

Hi Forest, cheers for the comments. I know dedupe, the others are cool and parserator looks very cool (I've had a tiny bit of experience with crfsuite before and need to spend some time playing). I'm just finishing IP delivery for my client this week, from next week I'm much more free and I'll think on integrating these links. In terms of "do I want to add specific data cleaning tools" - oh my golly gosh yes. The number of hours I've burned doing this all by hand is beyond silly. We can maybe save others some of the pain :-) Cheers, Ian.

ianozsvald commented 8 years ago

Hey Forest. Way late I know - I've added your examples into the doc - many thanks! Ian.