ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.23k stars 591 forks source link

Cant pickle TableExpr in 3.0.2 #3914

Closed saschahofmann closed 2 years ago

saschahofmann commented 2 years ago

We are caching table expressions in redis but table.op().source (the table connection) isnt pickleable. Before 3.0.2 we were able to set that to None

tbl.op().source = None

and vice versa when fetching from cache.

Now the table (inheriting from Annotable) is immutable and I can't set that attribute TypeError: Attribute 'source' cannot be assigned to immutable instance of type <class 'ibis_bigquery.client.BigQueryTable'>

Can I maybe duplicate the instance and only change that one prop?

saschahofmann commented 2 years ago

I now cache the schema only and construct the table expr from cached schemas. I'd still be curious whether there is a way to achieve the above?

gforsyth commented 2 years ago

Hey @saschahofmann -- I need to set up a bigquery account so I can test this myself, but does deleting the source attribute get the same thing done?

del tbl.op().source
pickle.dumps(tbl)

If not, I think this can be worked around by defining a custom Pickle class and using the reducer_override to skip over the source attribute

saschahofmann commented 2 years ago

Hm seems like pickle.dumps is actually working. The error with pickling happens when I try to cache the table in redis. I am trying to find out what call exactly is causing it.

saschahofmann commented 2 years ago

I am also struggling to get the original error locally but the error for sure was that the source wasnt serialiazable

How would I recover the BiqQueryTable with a connection since I can't set the source on it?

right now I am creating it like this with the schema coming from cache

tbl = TableExpr(
            BigQueryTable(
                name=f"{settings.GCP_PROJECT}.{table.bq_dataset}.{table.bq_table}",
                schema=schema,
                source=conn,
            )
        )
saschahofmann commented 2 years ago

I would also assume that it happens for any backend not only for BigQuery

cpcloud commented 2 years ago

There are a couple options to consider:

  1. You can always call object.__setattr__(op, "source", new_source). This is NOT recommended, but extremely expedient. source is not included in the __hash__ computation for reasons similar to those for why source cannot be pickled, FYI.
  2. You can cache UnboundTables and always run execute by calling it as a method on the Backend instance as opposed to on the Expr instance. This may or may not be viable for you. If you can do this, I would recommend it. Example:

    con = ibis.bigquery.connect(...)
    t = ibis.table(dict(a="int64"))
    con.execute(t)
saschahofmann commented 2 years ago

Thx @cpcloud ! 2. is not really an option and I think I will stick to caching the schema for now since this is the main reason we cache the table anyway. I can then recreate the table as mentioned above!

Just out of curiousity: why is the table object immutable? Maybe my use case is too niche but maybe it'd be nice to have an easy way to create a new object from an existing one with different kwargs?

cpcloud commented 2 years ago

Just out of curiousity: why is the table object immutable?

The main reason is to allow operations to be hashable. We use dictionaries whose keys are ops.Node instances in many places. One important way in which we use them is to avoid unnecessary (re)computation.

saschahofmann commented 2 years ago

yeah ok gotcha. Closing

cpcloud commented 2 years ago

Thanks @saschahofmann, really appreciate your feedback!