move-coop / parsons

A python library of connectors for the progressive community.
https://www.parsonsproject.org/
Other
259 stars 131 forks source link

[Bug] Parsons should fail gracefully when add_column logic throws exceptions #817

Open talevy42 opened 1 year ago

talevy42 commented 1 year ago

Detailed Description

When you call add_column and pass in a function that may throw an exception on edge case logic, the column is not evaluated until the Table is materialized, utilizing lazy loading. This is fine, but it causes issues if you try to reuse or correct the column's logic, as now any new operation on the Table, including attempts to remove and re-add the failing column, throw that exception. Because of the lazy loading, the Parsons table believes it has successfully added the column.

Ideally, when adding a particular column throws an exception, Parsons should fail gracefully and simply not add the column, or else remove the column before throwing the exception.

To Reproduce

tb = parsons.Table([{"data": 1}, {"data": 2}, {"data": 3}])
def conditional_error(x):
    if x.data == 2:
        raise ValueError("Data can't be 2")
    return "Good"
tb.add_column("new_column", conditional_error)
tb.remove_column("new_column") #This will always fail

Your Environment

Jason94 commented 1 year ago

Thanks for the issue! This does seem strange. A very basic example works as expected:

table = Table([{'x': 1}, {'x': 0}, {'x': 2}])
table.add_column('halfx', lambda r: 2 / r['x'])
table.materialize() # Should raise a ZeroDivisionError

But it does seem like lazily deferring the calculation should be done in a smarter way. In particular things like removing, renaming, etc the column that don't require doing the calculation shouldn't raise the error.

Here is another example that I find counterintuitive. I'd expect that after my except block, the table would have reverted to the original state it was in, like you suggested. In fact the final print statement errors again:

from parsons import Table

table = Table([{"x": 1}, {"x": 0}, {"x": 2}])
print(table)

try:
    table.add_column("two_div_x", lambda r: 2 / r["x"])
    table.materialize()
except Exception as e:
    print(str(e))

print(table) # This raises an error, even though it seems like it shouldn't
sduveen-pivot commented 11 months ago

The problem is that petl creates an onion of transformations, so even when remove_column runs cut() the cut only renders a result that stops showing (but apparently, still runs/renders) the "removed" columns.

In order to fix this, we would have to pretty much rewrite at least a LOT of class ETL to stop the onion-ifying, which might not even be backwards compatible. Even if we tried to be as clinical as possible and only rewrote, say add_column, remove_column, fill_column, and a couple others, subsequent or earlier onion wrappers could lose coherence -- e.g. if I do an add_column then a move_column and then a remove_column, suddenly the move-transformation would fail because we truly removed the column.

I suppose something could 'traverse the onion' and look for all transformations that refer to a specific column, but then we would essentially be implementing insert/delete functions on a linked-list (e.g. traversing we'd see that the MoveFieldView.field relates to a column we're removing, and so then have to 'move' the MoveFieldView.table to the .table value of the subsequent/higher-up wrapper View.

If we want to be clinical, we could allow fixing only right-after-add_column, and then remove_column would have some logic that tests if self.table is of type AddFieldView with a .field == column_to_remove then we just pop self.table to self.table.source.sources (lol, because it's double-wrapped in a StackView())

schuyler1d commented 11 months ago

(we could also consider adding an undo() command that pops the last transformation after the last materialize() was called)