okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
388 stars 110 forks source link

Proposal: Messytables 2 #142

Open pudo opened 9 years ago

pudo commented 9 years ago

messytables just turned 4 years, and I'm getting the sense that it could use a major overhaul to make sure it doesn't turn into a messy thing itself. While @pwalsh proposed starting a clean library (https://github.com/okfn/datatable-py/issues/1), I think we should instead do a breaking update. This should incorporate some lessons learned:

pwalsh commented 9 years ago

As we discussed over IRC, I'm in agreement, as long we are very careful to have a clean API for "Data Table Iteration" that is not mixed with any other magic that messytables may or may not do.

So, copying my notes from here, I'd like to see:

pudo commented 9 years ago

I've started working on a general clean-up branch at https://github.com/okfn/messytables/tree/cleanup-mt2. Since we'll break compatibility on some of the API anyway, it seems like a valid idea to remove some left-overs.

I also want to adopt an approach where we rather rely on external librares (like six), rather than build our own.

turicas commented 9 years ago

Hello, guys. I've started working on a library that implement these requirements (but the focus is a bit different): turicas/rows. I'm working now on a complete API rewrite so it'll be very simple and easy to use yet powerful (like automatically identifying field types and converting them). We may share some work among the two libraries. ;-)

jqnatividad commented 9 years ago

+1. CKAN Datapusher uses messytables and it unfortunately periodically produces literal messytables in the datastore (pardon the pun)

Having the ability to cast JTS/JSON Schema would be nice. Perhaps, this is a more pragmatic way to implement https://github.com/ckan/ideas-and-roadmap/issues/150 and address the datapusher issues CKAN implementations encounter rooted in messytables guessing data types incorrectly.

Maybe on the first pass, the guessed schema can be presented to the CKAN user leveraging the existing ...as_JTS methods, and the user can optionally override the JTS datatype guesses, and then leverage the proposed MessyTables 2 JTS-driven casting to insert the dataset as a proper table with the right datatypes into the CKAN datastore.