Open douglas-raillard-arm opened 1 month ago
Can you show the query? You might need to add first()
.
The query is not trivial as it involves multiple functions but the dumped JSON for v1.7.0 is:
import io
import polars as pl
plan = r'''{"IR":{"version":1428,"dsl":{"MapFunction":{"input":{"HStack":{"input":{"IR":{"version":1351,"dsl":{"HStack":{"input":{"MapFunction":{"input":{"Select":{"expr":[{"Column":"__timestamp"},{"Function":{"input":[{"Function":{"input":[{"Column":"line"},{"Literal":"Null"}],"function":{"StringExpr":"StripChars"},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"function":{"StringExpr":{"ExtractGroups":{"dtype":{"Struct":[{"name":"data","dtype":"String"},{"name":"event","dtype":"String"}]},"pat":"(?:[\\w@]+):? *(?:data=(?P<data>.+?)(?: +|$)|event=(?P<event>.+?)(?: +|$)|\\w+=\\S+?(?: +|$))+"}}},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"input":{"Filter":{"input":{"MapFunction":{"input":{"Select":{"expr":[{"Column":"__event"},{"Column":"__fields"},{"Column":"__timestamp"},{"Column":"line"}],"input":{"IR":{"version":1269,"dsl":{"HStack":{"input":{"Select":{"expr":[{"Column":"__timestamp"},{"Column":"__event"},{"Column":"__fields"},{"Column":"line"}],"input":{"HStack":{"input":{"DataFrameScan":{"df":{"columns":[{"name":"__event","datatype":"Binary","bit_settings":"","values":[[114,116,97,112,112,95,109,97,105,110],[114,116,97,112,112,95,109,97,105,110],[114,116,97,112,112,95,109,97,105,110]]},{"name":"__fields","datatype":"Binary","bit_settings":"","values":[[101,118,101,110,116,61,115,116,97,114,116],[101,118,101,110,116,61,99,108,111,99,107,95,114,101,102,32,100,97,116,97,61,52,55,49,51,50,52,56,54,48],[101,118,101,110,116,61,101,110,100]]},{"name":"__timestamp","datatype":"Int64","bit_settings":"","values":[471410977940,471412970020,472920141960]},{"name":"line","datatype":"Binary","bit_settings":"","values":[[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,115,116,97,114,116,10],[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,99,108,111,99,107,95,114,101,102,32,100,97,116,97,61,52,55,49,51,50,52,56,54,48,10],[114,116,97,112,112,95,109,97,105,110,58,32,101,118,101,110,116,61,101,110,100,10]]}]},"schema":{"fields":{"__event":"Binary","__fields":"Binary","__timestamp":"Int64","line":"Binary"}},"output_schema":null,"filter":null}},"exprs":[{"Cast":{"expr":{"DtypeColumn":["Binary"]},"dtype":"String","options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"exprs":[{"Cast":{"expr":{"Column":"__event"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Drop":{"to_drop":[{"Root":{"Column":"__fields"}}],"strict":false}}}},"predicate":{"BinaryExpr":{"left":{"Column":"__event"},"op":"Eq","right":{"Literal":{"String":"rtapp_main"}}}}}},"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Unnest":[{"Root":{"Column":"line"}}]}}},"exprs":[{"Function":{"input":[{"DtypeColumn":["String"]},{"Literal":"Null"}],"function":{"StringExpr":"StripChars"},"options":{"collect_groups":"ElementWise","fmt_str":"","check_lengths":true,"flags":"ALLOW_GROUP_AWARE"}}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}}}},"exprs":[{"Cast":{"expr":{"Column":"__timestamp"},"dtype":"Int64","options":"Strict"}},{"Cast":{"expr":{"Column":"data"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}},{"Cast":{"expr":{"Column":"event"},"dtype":{"Categorical":[null,"Physical"]},"options":"Strict"}}],"options":{"run_parallel":true,"duplicate_check":true,"should_broadcast":true}}},"function":{"Rename":{"existing":["__timestamp"],"new":["Time"]}}}}}}'''
plan = io.StringIO(plan)
df = pl.LazyFrame.deserialize(plan, format='json')
print(df)
print(df.collect())
The code leading to this lazyframe is either:
EDIT: the LazyFrame pretty printed is:
RENAME
WITH_COLUMNS:
[col("data").strict_cast(Categorical(None, Physical)), col("event").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
[col("data").str.strip_chars([null]), col("event").str.strip_chars([null]), col("__timestamp").strict_cast(Int64)]
UNNEST by:[line]
SELECT [col("__timestamp"), col("line").str.strip_chars([null]).str.extract_groups()] FROM
simple π 3/3 ["__event", "__timestamp", ... 1 other column]
FILTER [(col("__event")) == (String(rtapp_main))] FROM
WITH_COLUMNS:
[col("__event").strict_cast(Categorical(None, Physical))]
simple π 3/3 ["__timestamp", "__event", ... 1 other column]
WITH_COLUMNS:
[col("__event").strict_cast(String), col("line").strict_cast(String)]
DF ["__event", "__fields", "__timestamp", "line"]; PROJECT 3/4 COLUMNS; SELECTION: None
EDIT 2: I'm rebuilding to get an up-to-date JSON for v1.8. EDIT 3: the JSON IR is now refreshed and I updated the issue description to include it, rather than just pointing at the old one in the old issue.
Part of the problem here is that the error happens very late when collecting, at which point location information is completely lost. Is there a way to make polars validate at every step as some kind of debug mode ?
EDIT: the other part of the problem is that the issue is not reproducible locally, but happens 100% of the time when building our documentation in the readthedocs.org runner.
Can you serialize the result before running? As I cannot reproduce this locally. If I can reproduce I know what it can be. This query plan formatted doesn't seem to have any scalar misuse.
That's where the error comes from. It checks at runtime if the literal is allowed to be broadcasted. This is only allowed if the literal is a scalar.
Can you serialize the result before running?
If you mean serialize to JSON before calling .collect()
, this is what is in the reproducer of the current ticket (I updated it 45min ago). What is puzzling is that even with that JSON I also cannot reproduce locally. It only ever happened on readthedocs for whatever reason ...
Could you compile from source with #18904?
That will try to print the expression that is at fault.
I'll give it a go but this will probably take a while. Are the compiled binaries statically linked and fully portable ? I have no idea what libc is used on that runner, and compiling in-situ is impossible because of build timeouts.
Ok, will patch tonight with an option to temporarily silence this error. Note that it will become a hard error in the future, but hopefully with the new better error message we can find the culprit.
You can then silence it by setting POLARS_ALLOW_NON_SCALAR_EXP="1"
. But first watch the improved error message for the expression. ;)
I tried to build a wheel and commit it to a branch to try it out:
# build the wheel with a reasonable size
maturin build -m py-polars/Cargo.toml --strip
The DSO shipping in the .whl file seems to only depend on glibc, but I haven't tried to check what runs in the CI runner.
>>> ldd target/wheels/polars.abi3.so
linux-vdso.so.1 (0x00007fffa7195000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000760bfb071000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000760bea000000)
/lib64/ld-linux-x86-64.so.2 (0x0000760bfb0bb000)
And then it failed to install in the readthedocs runner for some unknown reason:
ERROR: polars-1.8.1-cp38-abi3-manylinux_2_39_x86_64.whl is not a supported wheel on this platform.
So I just re-ran the job, which installed 1.8.2 and the error stays:
polars.exceptions.InvalidOperationError: Series: line, length 1 doesn't match the DataFrame height of 3
If you want this Series to be broadcasted, ensure it is a scalar (for instance by adding '.first()').
https://readthedocs.org/api/v2/build/25738204.txt
I tried with POLARS_ALLOW_NON_SCALAR_EXP=1 just before collect():
import os
os.environ['POLARS_ALLOW_NON_SCALAR_EXP'] = '1'
df = df.collect()
And still get the same error: https://readthedocs.org/api/v2/build/25738487.txt
So this makes me wonder if either the error is coming from another place, or v1.8.2 does not have this code in it somehow. Pip does install 1.8.2 and nothing else according to the log.
EDIT: this makes me realize that setting os.environ
is probably useless as polars will use the Rust code, which is probably looking at the environment vector (directly or via libc). Modifying os.environ
does not modify that, it only lives in the Python interpreted world.
EDIT 2: I used RTD facility to set an env var for the whole runner before any code runs, and still get the same error ...
Can you set POLARS_PANIC_ON_ERR=1
and RUST_BACKTRACE=1
? This will give us the full stacktrace. That way we can see where it occurs.
DIT: this makes me realize that setting os.environ is probably useless as polars will use the Rust code, which is probably looking at the environment vector (directly or via libc). Modifying os.environ does not modify that, it only lives in the Python interpreted world.
You must set it before Polars is imported.
Here we go:
Hmm... I don't understand where the parquet writer comes from? Is that somewhere else?
I think this failure is down the line in another place, on another dataframe. So enabling these options "fixed" the issue reported here. I'll make another run and see whether that behavior is stable.
Did another run, same output failing in sink_parquet(): https://readthedocs.org/api/v2/build/25750324.txt
I re-ran the code with the sink_parquet() removed, and the issue is the same as before, now with polars 1.9.0: https://readthedocs.org/api/v2/build/25824188.txt
File "/home/docs/checkouts/readthedocs.org/user_builds/lisa-linux-integrated-system-analysis/envs/test-rtd/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2050, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: Series: line, length 1 doesn't match the DataFrame height of 3
And the verbose backtrace:
Alright, I know where it happens now. Added the same trick, so you should (hopefully) get an error message showing the expression that caused it.
And a way to temporarily silence the error. Can you come back with the faulting expression after next release?
Sounds good, thanks
Looks like I get another issue now with 1.11 , related to the type of the column (Time should be u64, not a string): https://readthedocs.org/api/v2/build/26067058.txt
I'll investigate further, with enough luck that can be reproduced locally and it's possibly hiding the original issue we discussed here.
@ritchie46 I tried locally with polars 1.11 both with Python 3.11 and 3.12 and it runs without problems. It only fails in the CI, so I guess this is just the new manifestation of the same underlying issue:
https://readthedocs.org/api/v2/build/26068615.txt
pyo3_runtime.PanicException: unexpected value while building Series of type Int64; found value of type String: "Time"
This happens while converting the following pandas df to polars:
pl.from_pandas(df, include_index=True)
__cpu __pid frequency cpu __comm
Time
0.004580 5 0 450000 0 <idle>
0.004582 5 0 450000 3 <idle>
0.004585 5 0 450000 4 <idle>
0.004587 5 0 450000 5 <idle>
0.050529 0 3807 575000 0 sshd
... ... ... ... ... ...
2.484556 5 0 450000 3 <idle>
2.484558 5 0 450000 4 <idle>
2.484560 5 0 450000 5 <idle>
2.496534 1 5741 950000 1 sshd
2.496535 1 5741 950000 2 sshd
[352 rows x 5 columns]
df.info() shows this
<class 'pandas.core.frame.DataFrame'>
Index: 352 entries, 0.00457992 to 2.4965352800000002
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 __cpu 352 non-null uint32
1 __pid 352 non-null int32
2 frequency 352 non-null uint32
3 cpu 352 non-null uint32
4 __comm 352 non-null category
dtypes: category(1), int32(1), uint32(3)
memory usage: 9.3 KB
And the index is float64:
Index([ 0.00457992, 0.0045823800000000005, 0.00458452,
0.00458656, 0.05052936, 0.05053342,
0.05053676, 0.05054004, 0.060652920000000006,
0.06065624000000001,
...
2.47257768, 2.47257974, 2.48102298,
2.48102532, 2.4845535620000003, 2.48455628,
2.48455826, 2.4845601, 2.49653368,
2.4965352800000002],
dtype='float64', name='Time', length=352)
EDIT: s/include_index='Time'/include_index=True/
Re-opening https://github.com/pola-rs/polars/issues/18719 as it is still failing in v1.8.1
Checks
Reproducible example
Same reproducer as on https://github.com/pola-rs/polars/issues/18719 but re-ran with 1.8.1:
Log output
Issue description
Collecting that LazyFrame triggers an exception in readthedocs CI but not locally, even after re-creating the same environment (pip freeze). The only material difference I can think of is some StringCache() state that is difference for whatever reason.
Note that this issue only started occuring from polars 1.7.0. Before that, the code was working.
Also note that the JSON plan is only there to make reproduction of the issue easier (both for me to extract the data from the CI log and for that bug report). The issue originally happened without that JSON layer (at least not at this spot). I also ended up trying the reported reproducer verbatim both in the CI and locally, with the same result (fails in the CI, succeeds locally).
Expected behavior
This should work or not work, but consistently everywhere. Most likely work.
Installed versions
Polars upgraded to 1.8.1 compared to initial report.