Open turicas opened 9 years ago
Wow, so many questions. Here are my 2¢ regarding laziness:
It may be the haskeller in me talking, but I really like laziness mostly because of 1) you may keep in memory only what you actuallyneed and 2) unused "broken" data does not break the whole operation. Reason 2 can be important for rows
users if they are handling data from multiple possibly unstructured sources, such as via web scraping.
Laziness can be combined with custom methods for filter()
, map()
and the like allowing for a memory efficient abstraction with the possibility of optimizations like (assuming absence of side-effects):
t.map(foo).map(bar)
-> t.map(lambda x: bar(foo(x)))
. That is, the map operations are compressed and no intermediate structure is needed (saves memory, and may improve performance regarding cache misses and branch prediction).
However, it does not make much sense to have a lazy mutable structure, so IMO you can either have laziness or A.4. In any case, if Table
is lazy and immutable, changes would be made by creating a new Table anyway using filter, map, reduce, flatmap etc.
A.6) If you want to update your CSV inplace, why use CSV at all? I think it makes much more sense to export it to some sort of store (sqlite etc.), operate over it and then, if needed, saving a new csv file.
Sure, if the data is too big it can be expensive to do all this importing-exporting, but I think it beats having to deal with:
@etandel, I'm not sure if lazyness should be something default or if we should use it only in special cases (like when you want to import a huge CSV or export it to a database). Anyway, the Brett Slatkin talk "How to Be More Effective with Functions" (at PyCon 2015) may give us some insights.
Design issues
Some decisions need to be made before we declare the API as stable. We can put here all the questions for discussing (we should answer these questions as soon as possible since it impacts the current implementation and would cause rework if delayed).
(A) About
rows.Table
rows.Table
be always lazy? Always not lazy? Support both? What are the implications? if it's lazy, how to deal with deletion and addition of rows?rows.Table
with many rows but want to filter some rows. Should we provide a special method for this or use Python's built-infilter
? Using Python's built-infilter
would be the more Pythonic way but we can optimize some operations on certain plugins if we provide a special method (example: filtering on a MySQL-basedTable
).rows.Table
like in question A.2: it's a filter to be executed during importation process so we're going to import only some rows.Table
. User can specify a custom function that will receiveTable.Row
object and return a new one (that should be returned when iterating over theTable
). This way we can deal with addition of new fields and other custom operations online. How should we expose this API? This implementation may solve problem on question A.3.collections.namedtuple
. What is the best API to change it? Should the default be another one? If we want an object with read-write access and also value access via attributes AttrDict would be a good option. Should we add metadata to the row instance, like its index on thatTable
? Seesqlite3.Row
and other Python's DBAPI implementations.rows
current architecture is good for importing and exporting data but is not well suited for working with that data. One of the key facts is that we cannot create aTable
from a CSV, change some rows' values and save it to the same CSV without doing a batch operation. Should we implement read-write access? It can add a lot of complication on the implementation (not only theTable
itself but in the plugins) since we'll need to deal with problems like seeking hrough the rows, saving/flushing partial data (not the entire set), amont other problems.rows
to import-and-export data it'd be handy if we have a shortcut (and maybe some optimizations) to do it. If the entireTable
is lazy we may not need this shortcut because we can iterate over oneTable
(in a lazy way) at the same time we're saving into another.__add__
(so, for example,sum([table1, table2, ..., tableN])
will return anotherTable
with all the rows -- but only if all table's types are the same). What metadata should remain?(B) About
rows.fields
(C) About Plugins
rows.Table
and implement only the needed methods to access data (everything else should be made byrows.Table
). This way we can optimize operations like__len__
,__reverse__
and others. These magic methods may be implemented only onrows.Table
and not overwritten (the plugin class would create a custom methodrows.Table
will call for each operation) -- we need to specify these methods' API.text
,json
,csv
,sqlite
.xls
,html
,ods
. See graphlab's connectors and tablib's supported extensions.Table.__rows
? What plugins can and cannot do with it? What is the expected behaviour?Table.meta
with metadata about thatTable
. For example: plugin data if theTable
was generated by a plugin (example: if the plugin iscsv
could have the actual CSV filename, encoding and so on).(D) About CLI
--query
(to query using SQL -- same as import-and-filter)?(E) Other
rows.Table
itself), a HTML file could contain more than one<table>
. See how tablib deals with it.detect_types
.