rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

Tools and Workflows for Repeatable Sharable Data Cleaning / ETL / Processing #58

Open rufuspollock opened 11 years ago

rufuspollock commented 11 years ago
rossjones commented 11 years ago

I'm building ScraperWikiX at https://github.com/rossjones/ScraperWikiX/

psychemedia commented 11 years ago

Open Refine recipes/vignettes, especially for "standardised" data formats? eg http://schoolofdata.org/2013/07/26/using-openrefine-to-clean-multiple-documents-in-the-same-way/

webysther commented 9 years ago

https://github.com/OpenRefine/OpenRefine

chrismattmann commented 9 years ago

Apache OODT? http://oodt.apache.org/ Check out DRAT (Distributed Release Audit Tool) as an example of OODT ETL in action: http://github.com/chrismattmann/drat.git

lexman commented 9 years ago

tuttle is also as tool for repeatable workflow that is very friendly with team collaboration, and continuous integration (like jenkins for updating data every hour, for example)