I am trying to get a join parquet file where the column "from_address" and "to_address" of "valid_trxn_160wk.parquet" match with the respective column in ""reg_sender_160wk.parquet" and ""reg_sender_160wk.parquet", repectively. However, when I try to access the result file, "considered_trxn_160wk.parquet", it shows me "Invalid thrift: protocol error", I tried to check whether all three initial files are corrupted by trying to read it and it works. Also, accessing the three files and make query with
works perfectly find. Except the memory usage is huge and almost consume all of my memory capacity. That's why I sink it.
So, I am not sure what happended here.
Checks
Reproducible example
Log output
No response
Issue description
I am trying to get a join parquet file where the column "from_address" and "to_address" of "valid_trxn_160wk.parquet" match with the respective column in ""reg_sender_160wk.parquet" and ""reg_sender_160wk.parquet", repectively. However, when I try to access the result file, "considered_trxn_160wk.parquet", it shows me "Invalid thrift: protocol error", I tried to check whether all three initial files are corrupted by trying to read it and it works. Also, accessing the three files and make query with
works perfectly find. Except the memory usage is huge and almost consume all of my memory capacity. That's why I sink it. So, I am not sure what happended here.
Error
ComputeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 considered_trxn_r = (pl.scan_parquet("considered_trxn_160wk.parquet") 2 .with_columns(('0x'+ pl.col("transaction_hash").bin.encode("hex")).alias("encoded_th")) 3 .with_columns(('0x'+ pl.col("from_address").bin.encode("hex")).alias("from")) 4 .with_columns(('0x'+ pl.col("to_address").bin.encode("hex")).alias("to")) 5 .select(pl.col("block_number","encoded_th", "value_f64", "gas_used", "from", "to", 6 "n_input_bytes", "n_input_zero_bytes", "n_input_nonzero_bytes")) 7 .collect(streaming=True) 8 )
File ~/jax_env/lib/python3.10/site-packages/polars/_utils/deprecation.py:134, in deprecate_renamed_parameter..decorate..wrapper(*args, kwargs)
129 @wraps(function)
130 def wrapper(*args: P.args, *kwargs: P.kwargs) -> T:
131 _rename_keyword_argument(
132 old_name, new_name, kwargs, function.name, version
133 )
--> 134 return function(args, kwargs)
File ~/jax_env/lib/python3.10/site-packages/polars/_utils/deprecation.py:134, in deprecate_renamed_parameter..decorate..wrapper(*args, kwargs)
129 @wraps(function)
130 def wrapper(*args: P.args, *kwargs: P.kwargs) -> T:
131 _rename_keyword_argument(
132 old_name, new_name, kwargs, function.name, version
133 )
--> 134 return function(args, kwargs)
File ~/jax_env/lib/python3.10/site-packages/polars/io/parquet/functions.py:394, in scan_parquet(source, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, hive_schema, rechunk, low_memory, cache, storage_options, retries) 391 else: 392 source = [normalize_filepath(source) for source in source] --> 394 return _scan_parquet_impl( 395 source, 396 n_rows=n_rows, 397 cache=cache, 398 parallel=parallel, 399 rechunk=rechunk, 400 row_index_name=row_index_name, 401 row_index_offset=row_index_offset, 402 storage_options=storage_options, 403 low_memory=low_memory, 404 use_statistics=use_statistics, 405 hive_partitioning=hive_partitioning, 406 hive_schema=hive_schema, 407 retries=retries, 408 )
File ~/jax_env/lib/python3.10/site-packages/polars/io/parquet/functions.py:454, in _scan_parquet_impl(source, n_rows, cache, parallel, rechunk, row_index_name, row_index_offset, storage_options, low_memory, use_statistics, hive_partitioning, hive_schema, retries) 450 else: 451 # Handle empty dict input 452 storage_options = None --> 454 pylf = PyLazyFrame.new_from_parquet( 455 source, 456 sources, 457 n_rows, 458 cache, 459 parallel, 460 rechunk, 461 parse_row_index_args(row_index_name, row_index_offset), 462 low_memory, 463 cloud_options=storage_options, 464 use_statistics=use_statistics, 465 hive_partitioning=hive_partitioning, 466 hive_schema=hive_schema, 467 retries=retries, 468 ) 469 return wrap_ldf(pylf)
ComputeError: parquet: File out of specification: Invalid thrift: protocol error
Expected behavior
The "considered_trxn_160wk.parquet" can be read and further processed with pl.scan_parquet or pl.read_parquet
Installed versions