Open AndreiPashkin opened 2 months ago
This has nothing to do with joblib or multiprocessing
import polars as pl
df = pl.DataFrame(pl.Series("a",["1","2","3"],dtype=pl.Categorical))
df.filter(pl.col("a") == pl.lit("1",dtype = pl.Categorical))
The error indicates that the literal is coming from a different Stringcache than the original column (See https://docs.pola.rs/user-guide/concepts/data-types/categoricals/).
I suppose for literals we could cast them to the correct stringcache as we do for strings, which makes sense to me
For now, removing the dtype on the literal should work pl.lit(game_id)
@c-peters for learning purpose, may you please point me the place that the right StringCache
feched, given a string literal?
Thanks.
Ignore my comment, I did not notice you enabled the string cache globally. I am not too familiar with joblib, but what I am assuming is that joblib starts a seperate process entirely which receives a new cache. setting the backend to threading should work
@c-peters, I was thinking that it is because Polars does not play very well with forking, are you sure it is not the case? When I set joblib's backend to threading
it indeed works.
If you fork you start completely new processes and a global string cache cannot be global between processes. This is a bug in your query/usage.
If you fork you start completely new processes and a global string cache cannot be global between processes. This is a bug in your query/usage.
Yes, that's true. I just wanted to confirm that.
Checks
Reproducible example
Log output
Issue description
It seems like Polars has a problem with maintaining the internal representation of Categorical values, either due to multi-processing or due to how
joblib
serializes and deserializes data. Or maybe I'm just doing something wrong :)p.s. Switching to
Enum
fromCategorical
makes this problem go away.Expected behavior
I obviously expect no exception happening.
Installed versions