turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
863 stars 136 forks source link

Design issues #31

Open turicas opened 9 years ago

turicas commented 9 years ago

Design issues

Some decisions need to be made before we declare the API as stable. We can put here all the questions for discussing (we should answer these questions as soon as possible since it impacts the current implementation and would cause rework if delayed).

(A) About rows.Table

etandel commented 8 years ago

Wow, so many questions. Here are my 2¢ regarding laziness:

It may be the haskeller in me talking, but I really like laziness mostly because of 1) you may keep in memory only what you actuallyneed and 2) unused "broken" data does not break the whole operation. Reason 2 can be important for rows users if they are handling data from multiple possibly unstructured sources, such as via web scraping.

Laziness can be combined with custom methods for filter(), map() and the like allowing for a memory efficient abstraction with the possibility of optimizations like (assuming absence of side-effects): t.map(foo).map(bar) -> t.map(lambda x: bar(foo(x))). That is, the map operations are compressed and no intermediate structure is needed (saves memory, and may improve performance regarding cache misses and branch prediction).

However, it does not make much sense to have a lazy mutable structure, so IMO you can either have laziness or A.4. In any case, if Table is lazy and immutable, changes would be made by creating a new Table anyway using filter, map, reduce, flatmap etc.

etandel commented 8 years ago

A.6) If you want to update your CSV inplace, why use CSV at all? I think it makes much more sense to export it to some sort of store (sqlite etc.), operate over it and then, if needed, saving a new csv file.

Sure, if the data is too big it can be expensive to do all this importing-exporting, but I think it beats having to deal with:

turicas commented 8 years ago

@etandel, I'm not sure if lazyness should be something default or if we should use it only in special cases (like when you want to import a huge CSV or export it to a database). Anyway, the Brett Slatkin talk "How to Be More Effective with Functions" (at PyCon 2015) may give us some insights.