Open GBMsejimenez opened 3 months ago
Can you set POLARS_PANIC_ON_ERR=1
and RUST_BACKTRACE=1
and show us the backtrace log?
Hi I'm not sure if I should start another issue for this, but I'm pretty sure I'm having the same issue. When running inside an AWS Lambda, I am able to read a CSV and write it to a Parquet file using read_csv and write_parquet, but not so much luck with scan_csv and sink_parquet. I'm getting the same type and error and have tried the same methods to solve the issue as @GBMsejimenez.
I've gotten the code down to the bare minimum for me to reproduce the error (the CSV file being tested only consists of a header and two lines of data, and the bucket and path in the file name have been edited out).
import polars as pl
import s3fs
import json
POLARS_PANIC_ON_ERR=1
RUST_BACKTRACE=1
# Lambda entry
def lambda_handler(event, context):
pl.show_versions()
csv_file = 's3://{BUCKET}/{PATH}/test.csv'
#parquet_file = 's3://{BUCKET}/{PATH}/test.parquet'
fs = s3fs.S3FileSystem(anon=False)
df = pl.scan_csv(csv_file).collect(streaming=True)
return {
'statusCode': 200,
'body': json.dumps("Finished")
}
This is giving me an error of (with {BUCKET} and {PATH} having actual values)
[ERROR] ComputeError: failed to allocate 1343 bytes to download uri = s3://{BUCKET}/{PATH}/test.csv
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 40, in lambda_handler
df = pl.scan_csv(csv_file).collect()
File "/opt/python/polars/lazyframe/frame.py", line 2027, in collect
return wrap_df(ldf.collect(callback))
My polars versions if necessary
@wjglenn3 I'm experiencing the same issue when using a Docker container based lambda
Hey, we are experiencing the same issue within docker in AWS Lambda, we attempted all the combinations.
I also tried installing s3fs, which is needed for the read_csv, but also breaks with error :
ComputeError : failed to allocate 12345 bytes to download uri = s3://...
Here's my minimum example that breaks :
import asyncio
import boto3
import polars as pl
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
session = boto3.session.Session(region_name="us-west-2")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"aws_session_token": credentials.token,
"aws_region": session.region_name,
}
async def do():
df = pl.scan_csv(
"s3://.../*.csv", # example path
storage_options=storage_options,
).collect()
print(df)
def lambda_handler(event, context):
uvloop.run(do())
return "OK"
@alexander-beedie could you please be so kind to treat this issue?
Thank you for the efforts!
Hi there,
Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized...
Lazyframe explain plan is
WITH_COLUMNS:
......
SELECT
...........
FROM
WITH_COLUMNS:
[false.alias("monthly_export_origin")
, String(abfs://.../.../../filename.csv).alias("export_filename")
, String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")]
Csv SCAN [abfs://.../.../../filename.csv]
PROJECT 65/65 COLUMNS
I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv
Cheers!
Hi there,
Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized...
Lazyframe explain plan is
WITH_COLUMNS: ...... SELECT ........... FROM WITH_COLUMNS: [false.alias("monthly_export_origin") , String(abfs://.../.../../filename.csv).alias("export_filename") , String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")] Csv SCAN [abfs://.../.../../filename.csv] PROJECT 65/65 COLUMNS
I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv
Cheers!
Hi,
From my understanding and my attempts, there's a bug not allowing to use scan_csv inside Lambda's docker. Hopefully someone can give more context here.
Hi there, Not sure that it matters, but I'm having the exact same issue here using docker image in AWS lambda, when collecting my lazyframe with execution plan. Hopefully the more people reporting running into this, the more the fix can be prioritized... Lazyframe explain plan is
WITH_COLUMNS: ...... SELECT ........... FROM WITH_COLUMNS: [false.alias("monthly_export_origin") , String(abfs://.../.../../filename.csv).alias("export_filename") , String(2024-11-22T10:17:11.599+00:00).str.strptime([String(raise)]).alias("rec_inserted")] Csv SCAN [abfs://.../.../../filename.csv] PROJECT 65/65 COLUMNS
I have cut out the select columns and some simple with_columns statements from this execution plan, and also the exact abfs path and filename, but it is trying to scan a csv file from an azure container. Code obviously runs fine locally, but not within lambda, with the exact same error message as described above: failed to allocate 122128901 bytes to download uri = abfs://.../.../../filename.csv Cheers!
Hi,
From my understanding and my attempts, there's a bug not allowing to use scan_csv inside Lambda's docker. Hopefully someone can give more context here.
It might be a bug on lambda's side instead of polars? I have premium support there, so I might create a ticket for AWS, to investigate from their side. Will let you know in this thread if and when anything comes out of that....
Checks
Reproducible example
Log output
Issue description
I'm new to Polars and attempting to implement an RFM analysis using the library. As part of my proposed architecture, I need to run the code in an AWS Lambda function. I've successfully implemented the RFM analysis and uploaded the code to Lambda using a Docker image.
Despite the code running successfully on my local container, I'm encountering a "failed to allocate 25954093 bytes" error when running it in the Lambda function. I've tried to troubleshoot the issue, ruling out credential errors since the scan_csv function doesn't throw any errors, and explicitly passing AWS credentials to the scan_csv function.
Attempts to Resolve I've attempted to apply solutions from issues #7774 and #1777, including:
Setting streaming=True on the collect method Defining my schema columns as pl.utf8 or pl.int
Thanks in advanced 🤗
Expected behavior
The Polars code should work seamlessly in the Lambda function, just like it does on the local container, without any memory allocation errors.
Installed versions