jeppe742 / DeltaLakeReader

Read Delta tables without any Spark
Apache License 2.0
47 stars 14 forks source link

Cannot load Delta Table from S3 through AWS Lambda #53

Open ISimion opened 10 months ago

ISimion commented 10 months ago

Unfortunately, I cannot use the latest s3fs because the latest delta-lake-reader[aws]==0.2.14 requires s3fs < 2023, and on s3fs==2022.11.0, I am getting a known issue with s3fs.

Why was that issue closed, I do not know, since it happened to lots of folks even to one of the latest versions, i.e. 2023.1.0.

Also, I would like to specify that my lambdas have all the policy permissions set to all the s3 objects and buckets through the IAM Role.

Could by any chance be released an update which can use the latest s3fs==2023.10.0, such I would know to address this as an s3fs issue, please?

Screenshot 2023-11-15 141635

[ERROR] PermissionError: Forbidden
Traceback (most recent call last):
  File "/var/task/inquire-data-set.py", line 118, in lambda_handler
    dt = DeltaTable(s3_path, file_system=fs)
  File "/var/lang/lib/python3.10/site-packages/deltalake/deltatable.py", line 40, in __init__
    if not self._is_delta_table():
  File "/var/lang/lib/python3.10/site-packages/deltalake/deltatable.py", line 62, in _is_delta_table
    return self.filesystem.exists(f"{self.log_path}")
  File "/var/lang/lib/python3.10/site-packages/fsspec/asyn.py", line 113, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/var/lang/lib/python3.10/site-packages/fsspec/asyn.py", line 98, in sync
    raise return_result
  File "/var/lang/lib/python3.10/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/var/lang/lib/python3.10/site-packages/s3fs/core.py", line 946, in _exists
    await self._info(path, bucket, key, version_id=version_id)
  File "/var/lang/lib/python3.10/site-packages/s3fs/core.py", line 1210, in _info
    out = await self._call_s3(
  File "/var/lang/lib/python3.10/site-packages/s3fs/core.py", line 339, in _call_s3
    return await _error_wrapper(
  File "/var/lang/lib/python3.10/site-packages/s3fs/core.py", line 139, in _error_wrapper
    raise err
jeppe742 commented 10 months ago

Thanks @ISimion . Will try to have a look tomorrow

jeppe742 commented 10 months ago

@ISimion published a new version. Let me know if it fixes your issues

ISimion commented 10 months ago

@jeppe742 Thank you for upgrading that library. I retried the code with the updated delta-lake-reader[aws]==0.2.16, which now includes the latest s3fs==2023.10.0.

Unfortunately, the code breaks from another requirement. The latest delta-lake-reader will install botocore==1.31.64; however, if I try to run an AWS Lambda from an image using the official Python 3.10 runtime the botocore forced version will be 1.29.90, and that will cause the following break:

[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda-name-function': No module named 'botocore. compress'

I do not know if there are other corner cases like this, but for sure, a lambda in an official Python 3.10 image won't work with the delta-lake-reader==0.2.16 library at this point.

Maybe another issue has to be open?

jeppe742 commented 10 months ago

Hey @ISimion Just for me to understand. When you got the initial error, you were using s3fs==2022.11.0?

If I try to find the latest version of s3fs compatible with botocore==1.29.90 I have to go all the way back to s3fs==0.4.2 which was released back in 2020. Which seems pretty old to me Maybe you were just lucky that s3fs==2022.11.0 worked with botocore==1.29.90, despite technically not being compatible? So not sure there is a nice way to handle it from my side. Unless I'm missing something?

I have no experience with AWS Lambda, but isn't it possible to define your own dependencies, including botocore?

ISimion commented 10 months ago

Hey @jeppe742

I will answer all your questions in order.

  1. Your assessment is correct. And yes, using s3fs==0.4.2 just to have botocore==1.29.90 is not a solution.

  2. It might be the case that I was lucky.

  3. I do not think you are missing anything. At this point, I have all the reasons to believe that this issue is related more to s3fs, so I will address it there. Since I was using s3fs through the library you provided, I believe it was only fair to ask you first for assistance.

  4. While using AWS Lambda through the officially provided AWS Runtime image (i.e. official VM with Amazon Linux 2 operating system installed with Python 3.10 and botocore and boto3), it seems that one only gets the botocore and boto3 versions mentioned by AWS, i.e. botocore==1.29.90 and boto3==1.26.90, even if I tried to force install on that machine other versions.

Although your point is valid, theoretically, I can install a machine with an operating system and requirements of my choosing and put the AWS Lambda image on that machine; that would be too much of a burden just to make a lambda work. I wanted to know before I got to this step that I eliminated any doubts about any other way of handling the read of Delta Tables through your library using the official AWS Runtime.

As I said, at this point, it is not a delta-lake-reader issue but an s3fs issue, and I would address it properly; thank you so much for your time and involvement. I consider this issue as being closed and let you do the honors.

jeppe742 commented 10 months ago

Thanks @ISimion Hope you get it to work

Alternatively you can also try looking into delta-rs

ISimion commented 10 months ago

Thank you as well. And yes, delta-rs is what I eventually end up using.