piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.22k stars 383 forks source link

Run open_smart[s3] without installing dependencies for other providers. #840

Open ODudek opened 1 month ago

ODudek commented 1 month ago

Problem description

Be sure your description clearly answers the following questions:

Steps/code to reproduce the problem

requirements.txt

smart_open[s3]==7.0.4
from smart_open import open
client = boto3.client(service_name='s3',
                      endpoint_url='xxx',
                      aws_access_key_id=config['access_key'],
                      region_name=config['region'],
                      aws_secret_access_key=config['secret_key'])
def open_stream(url: str, mode: str, args: dict):
    return open(url,
        mode=mode,
        transport_params={
            'client': client,
            'client_kwargs': args
        }
    )

All I need to do is run the application, and right after starting, I get the error: pkg_resources.DistributionNotFound: The 'google-cloud-storage' distribution was not found and is required by the application

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
Linux-5.10.225-213.878.amzn2.x86_64-x86_64-with-debian-bullseye-sid
print("Python", sys.version)
Python 3.7.17 (default, Sep 19 2023, 14:13:00) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

ddelange commented 1 month ago

Can you provide the full error traceback?

ODudek commented 1 month ago

sure

  File "test.py", line 1, in <module>                                                                                         
    from smart_open import open                                                                                                                                                       
  File "/app/lib/smart_open/__init__.py", line 34, in <module>                                                                                                      
    from .smart_open_lib import open, parse_uri, smart_open, register_compressor  # noqa: E402                                                                                        
  File "/app/lib/smart_open/smart_open_lib.py", line 35, in <module>                                                                                                
    from smart_open import doctools                                                                                                                                                   
  File "/app/lib/smart_open/doctools.py", line 21, in <module>                                                                                                      
    from . import transport                                                                                                                                                           
  File "/app/lib/smart_open/transport.py", line 101, in <module>                                                                                                    
    register_transport("smart_open.gcs")                                                                                                                                              
  File "/app/lib/smart_open/transport.py", line 49, in register_transport                                                                                           
    submodule = importlib.import_module(submodule)                                                                                                                                    
  File "/opt/sdk/python_3.7.17.3_x86_64/lib/python3.7/importlib/__init__.py", line 127, in import_module                                                                              
    return _bootstrap._gcd_import(name[level:], package, level)                                                                                                                       
  File "/app/lib/smart_open/gcs.py", line 15, in <module>                                                                                                           
    import google.cloud.storage                                                                                                                                                       
  File "/app/lib/google/cloud/storage/__init__.py", line 36, in <module>                                                                                            
    __version__ = get_distribution("google-cloud-storage").version                                                                                                                    
  File "/opt/sdk/python_3.7.17.3_x86_64/lib/python3.7/site-packages/pkg_resources/__init__.py", line 482, in get_distribution                                                         
    dist = get_provider(dist)                                                                                                                                                         
  File "/opt/sdk/python_3.7.17.3_x86_64/lib/python3.7/site-packages/pkg_resources/__init__.py", line 358, in get_provider                                                             
    return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]                                                                                                              
  File "/opt/sdk/python_3.7.17.3_x86_64/lib/python3.7/site-packages/pkg_resources/__init__.py", line 901, in require                                                                  
    needed = self.resolve(parse_requirements(requirements))                                                                                                                           
  File "/opt/sdk/python_3.7.17.3_x86_64/lib/python3.7/site-packages/pkg_resources/__init__.py", line 787, in resolve                                                                  
    raise DistributionNotFound(req, requirers)                                                                                                                                        
pkg_resources.DistributionNotFound: The 'google-cloud-storage' distribution was not found and is required by the application
ddelange commented 1 month ago

it looks like the /app/lib/google/cloud/storage/__init__.py file is present on your system, but potentially not installed via pip or otherwise not able to be discovered by pkg_resources.

can you attempt to pip uninstall google-cloud-storage? does that uninstall something? is the file still present on your system afterwards? if no, does your snippet start working?

ddelange commented 1 month ago

smart_open only catches ImportError to skip libs when they're not installed, but apparently on your system it can be (partially) imported but is not properly installed (erroring during import with DistributionNotFound when google-cloud-storage lib checks for its proper installation and that's a hard fail).

ODudek commented 1 month ago

I’ll try it later, but I’m curious why open_smart is looking for those packages when I only want to use S3

ddelange commented 1 month ago

It's doing so in order to populate smart_open.transport.SUPPORTED_SCHEMES (and underlying _REGISTRY, both used by get_transport) which has been part of the public API since v1.11.0