Open GBMsejimenez opened 2 months ago
Can you set POLARS_PANIC_ON_ERR=1
and RUST_BACKTRACE=1
and show us the backtrace log?
Hi I'm not sure if I should start another issue for this, but I'm pretty sure I'm having the same issue. When running inside an AWS Lambda, I am able to read a CSV and write it to a Parquet file using read_csv and write_parquet, but not so much luck with scan_csv and sink_parquet. I'm getting the same type and error and have tried the same methods to solve the issue as @GBMsejimenez.
I've gotten the code down to the bare minimum for me to reproduce the error (the CSV file being tested only consists of a header and two lines of data, and the bucket and path in the file name have been edited out).
import polars as pl
import s3fs
import json
POLARS_PANIC_ON_ERR=1
RUST_BACKTRACE=1
# Lambda entry
def lambda_handler(event, context):
pl.show_versions()
csv_file = 's3://{BUCKET}/{PATH}/test.csv'
#parquet_file = 's3://{BUCKET}/{PATH}/test.parquet'
fs = s3fs.S3FileSystem(anon=False)
df = pl.scan_csv(csv_file).collect(streaming=True)
return {
'statusCode': 200,
'body': json.dumps("Finished")
}
This is giving me an error of (with {BUCKET} and {PATH} having actual values)
[ERROR] ComputeError: failed to allocate 1343 bytes to download uri = s3://{BUCKET}/{PATH}/test.csv
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 40, in lambda_handler
df = pl.scan_csv(csv_file).collect()
File "/opt/python/polars/lazyframe/frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
My polars versions if necessary
@wjglenn3 I'm experiencing the same issue when using a Docker container based lambda
Hey, we are experiencing the same issue within docker in AWS Lambda, we attempted all the combinations.
I also tried installing s3fs, which is needed for the read_csv, but also breaks with error :
ComputeError : failed to allocate 12345 bytes to download uri = s3://...
Here's my minimum example that breaks :
import asyncio
import boto3
import polars as pl
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
session = boto3.session.Session(region_name="us-west-2")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"aws_session_token": credentials.token,
"aws_region": session.region_name,
}
async def do():
df = pl.scan_csv(
"s3://.../*.csv", # example path
storage_options=storage_options,
).collect()
print(df)
def lambda_handler(event, context):
uvloop.run(do())
return "OK"
@alexander-beedie could you please be so kind to treat this issue?
Thank you for the efforts!
Checks
Reproducible example
Log output
Issue description
I'm new to Polars and attempting to implement an RFM analysis using the library. As part of my proposed architecture, I need to run the code in an AWS Lambda function. I've successfully implemented the RFM analysis and uploaded the code to Lambda using a Docker image.
Despite the code running successfully on my local container, I'm encountering a "failed to allocate 25954093 bytes" error when running it in the Lambda function. I've tried to troubleshoot the issue, ruling out credential errors since the scan_csv function doesn't throw any errors, and explicitly passing AWS credentials to the scan_csv function.
Attempts to Resolve I've attempted to apply solutions from issues #7774 and #1777, including:
Setting streaming=True on the collect method Defining my schema columns as pl.utf8 or pl.int
Thanks in advanced 🤗
Expected behavior
The Polars code should work seamlessly in the Lambda function, just like it does on the local container, without any memory allocation errors.
Installed versions