Closed fgregg closed 8 years ago
Hi Forest, cheers for the comments. I know dedupe, the others are cool and parserator looks very cool (I've had a tiny bit of experience with crfsuite before and need to spend some time playing). I'm just finishing IP delivery for my client this week, from next week I'm much more free and I'll think on integrating these links. In terms of "do I want to add specific data cleaning tools" - oh my golly gosh yes. The number of hours I've burned doing this all by hand is beyond silly. We can maybe save others some of the pain :-) Cheers, Ian.
Hey Forest. Way late I know - I've added your examples into the doc - many thanks! Ian.
Really interesting read!
Don't know if you are looking for listing specific data cleaning tools, but we've built a few that are useful in our own work.
https://github.com/datamade/dedupe https://github.com/datamade/usaddress https://github.com/datamade/probablepeople https://github.com/datamade/parserator
For making reproducible data workflows, we also use Make. https://github.com/datamade/data-making-guidelines Would be interested to hear how you structure your data steps