turicas commented 9 years ago

Design issues

Some decisions need to be made before we declare the API as stable. We can put here all the questions for discussing (we should answer these questions as soon as possible since it impacts the current implementation and would cause rework if delayed).

(A) About `rows.Table`

[ ] A.1) What about lazyness? Should rows.Table be always lazy? Always not lazy? Support both? What are the implications? if it's lazy, how to deal with deletion and addition of rows?
[ ] A.2) How should we handle row filtering? What would be the best API? For example: we have a rows.Table with many rows but want to filter some rows. Should we provide a special method for this or use Python's built-in filter? Using Python's built-in filter would be the more Pythonic way but we can optimize some operations on certain plugins if we provide a special method (example: filtering on a MySQL-based Table).
[ ] A.3) What if we want to import everything filtered? It's not a filter on a pre-existing rows.Table like in question A.2: it's a filter to be executed during importation process so we're going to import only some rows.
[ ] A.4) We should provide an API to modify the current rows during the iteration over the Table. User can specify a custom function that will receive Table.Row object and return a new one (that should be returned when iterating over the Table). This way we can deal with addition of new fields and other custom operations online. How should we expose this API? This implementation may solve problem on question A.3.
[ ] A.5) The default row class is a collections.namedtuple. What is the best API to change it? Should the default be another one? If we want an object with read-write access and also value access via attributes AttrDict would be a good option. Should we add metadata to the row instance, like its index on that Table? See sqlite3.Row and other Python's DBAPI implementations.
[ ] A.6) rows current architecture is good for importing and exporting data but is not well suited for working with that data. One of the key facts is that we cannot create a Table from a CSV, change some rows' values and save it to the same CSV without doing a batch operation. Should we implement read-write access? It can add a lot of complication on the implementation (not only the Table itself but in the plugins) since we'll need to deal with problems like seeking hrough the rows, saving/flushing partial data (not the entire set), amont other problems.
[ ] A.7) As many users will use rows to import-and-export data it'd be handy if we have a shortcut (and maybe some optimizations) to do it. If the entire Table is lazy we may not need this shortcut because we can iterate over one Table (in a lazy way) at the same time we're saving into another.
[ ] A.8) Should implement __add__ (so, for example, sum([table1, table2, ..., tableN]) will return another Table with all the rows -- but only if all table's types are the same). What metadata should remain?
[ ] A.9) Which other operations should be implemented? Join, intersect, ...?
(B) About rows.fields
[ ] B.1) Field instances (values, actually) should be native Python objects or custom objects (based on custom classes)? I'm inclined to use native Python objects (as it's implemented today).
(C) About Plugins
[ ] C.1) Should plugins implement classes instead of functions? These classes should inherite from rows.Table and implement only the needed methods to access data (everything else should be made by rows.Table). This way we can optimize operations like __len__, __reverse__ and others. These magic methods may be implemented only on rows.Table and not overwritten (the plugin class would create a custom method rows.Table will call for each operation) -- we need to specify these methods' API.
[ ] C.2) What should be the list of default plugins? May be: text, json, csv, sqlite.
[ ] C.3) What should be the list of official plugins (available on PyPI, maintained by rows team but not pre-installed by default)? May be: xls, html, ods. See graphlab's connectors and tablib's supported extensions.
[ ] C.4) How should we represent the table rows internally? Table.__rows? What plugins can and cannot do with it? What is the expected behaviour?
[ ] C.5) Should add a Table.meta with metadata about that Table. For example: plugin data if the Table was generated by a plugin (example: if the plugin is csv could have the actual CSV filename, encoding and so on).
[ ] C.6) If we are dealing with a huge amount of data it'd nice to have callbacks and batch options (like the old MySQL plugin). How the API should be exposed?
These links may help:
- https://pythonhosted.org/setuptools/setuptools.html#dynamic-discovery-of-services-and-plugins
- https://github.com/nose-devs/nose/blob/master/nose/plugins/manager.py#L368
- https://nose.readthedocs.org/en/latest/plugins/writing.html
- https://github.com/flavioamieiro/nose-ipdb/blob/master/ipdbplugin.py
- https://pytest.org/latest/plugins.html#setuptools-entry-points
- http://docs.openstack.org/developer/stevedore/
  (D) About CLI
[ ] D.1) Should we implement --query (to query using SQL -- same as import-and-filter)?
(E) Other
[ ] E.1) How to deal with Table collections? Examples: a XLS file have more than one sheet (each one is a rows.Table itself), a HTML file could contain more than one <table>. See how tablib deals with it.
[ ] E.2) See sqlite's detect_types.

etandel commented 8 years ago

Wow, so many questions. Here are my 2¢ regarding laziness:

It may be the haskeller in me talking, but I really like laziness mostly because of 1) you may keep in memory only what you actuallyneed and 2) unused "broken" data does not break the whole operation. Reason 2 can be important for rows users if they are handling data from multiple possibly unstructured sources, such as via web scraping.

Laziness can be combined with custom methods for filter(), map() and the like allowing for a memory efficient abstraction with the possibility of optimizations like (assuming absence of side-effects): t.map(foo).map(bar) -> t.map(lambda x: bar(foo(x))). That is, the map operations are compressed and no intermediate structure is needed (saves memory, and may improve performance regarding cache misses and branch prediction).

However, it does not make much sense to have a lazy mutable structure, so IMO you can either have laziness or A.4. In any case, if Table is lazy and immutable, changes would be made by creating a new Table anyway using filter, map, reduce, flatmap etc.

etandel commented 8 years ago

A.6) If you want to update your CSV inplace, why use CSV at all? I think it makes much more sense to export it to some sort of store (sqlite etc.), operate over it and then, if needed, saving a new csv file.

Sure, if the data is too big it can be expensive to do all this importing-exporting, but I think it beats having to deal with:

concurrency: what if other processes are reading / writing from the CSV?
error handling: what if some error happens during the computation? If the CSV is being updated inplace, it could be left in an unconsistent state and that data may be lost.
Possible other things that have already been considered on databases, distributed storages etc.

turicas commented 8 years ago

@etandel, I'm not sure if lazyness should be something default or if we should use it only in special cases (like when you want to import a huge CSV or export it to a database). Anyway, the Brett Slatkin talk "How to Be More Effective with Functions" (at PyCon 2015) may give us some insights.

turicas / rows

Design issues #31

Design issues

(A) About `rows.Table`

(B) About `rows.fields`

(C) About Plugins

(D) About CLI

(E) Other

turicas / rows

Design issues #31

Design issues

(A) About rows.Table

(B) About rows.fields

(C) About Plugins

(D) About CLI

(E) Other

(A) About `rows.Table`

(B) About `rows.fields`