pdm-project / pdm

A modern Python package and dependency manager supporting the latest PEP standards
https://pdm-project.org
MIT License
7.84k stars 391 forks source link

Option to use extra sources only when dependency not present in Pypi #1509

Closed ferminho closed 1 year ago

ferminho commented 1 year ago

Is your feature request related to a problem? Please describe.

pdm lock can be very slow when dealing with a big pyproject and one of the libraries requires an extra index-url or find-links source, since it checks all sources for every library.

Concrete example that goes from ~30min without extra sources to 3-4h with the extra source: a pyproject reflecting the libraries in Databricks, one of those being PyTorch which requires an external source.

[project]
name = "databricks"
version = "1.0.0"
description = "Databricks DBR11.3 Python environment"
authors = [
    {name = "Test user", email = "test@test.com"},
]

dependencies = [
    "absl-py==1.0.0", 
    "argon2-cffi==20.1.0", 
    "astor==0.8.1", 
    "astunparse==1.6.3", 
    "async-generator==1.10", 
    "attrs==21.2.0", 
    "azure-core==1.22.1", 
    "azure-cosmos==4.2.0", 
    "backcall==0.2.0", 
    "backports.entry-points-selectable==1.1.1", 
    "bcrypt==4.0.0", 
    "black==22.3.0", 
    "bleach==4.0.0", 
    "blis==0.7.8", 
    "boto3==1.21.18", 
    "botocore==1.24.18", 
    "cachetools==5.2.0", 
    "catalogue==2.0.8", 
    "certifi==2021.10.8", 
    "cffi==1.14.6", 
    "chardet==4.0.0", 
    "charset-normalizer==2.0.4", 
    "click==8.0.3", 
    "cloudpickle==2.0.0", 
    "cmdstanpy==0.9.68", 
    "confection==0.0.1", 
    "configparser==5.2.0", 
    "convertdate==2.4.0", 
    "cryptography==3.4.8", 
    "cycler==0.10.0", 
    "cymem==2.0.6", 
    "Cython==0.29.24", 
    "databricks-automl-runtime==0.2.11", 
    "databricks-cli==0.17.3", 
    "dbl-tempo==0.1.12", 
    "dbus-python==1.2.16", 
    "debugpy==1.4.1", 
    "decorator==5.1.0", 
    "defusedxml==0.7.1", 
    "dill==0.3.4", 
    "diskcache==5.4.0", 
    "distlib==0.3.6", 
    "distro==1.4.0", 
    "entrypoints==0.3", 
    "ephem==4.1.3", 
    "facets-overview==1.0.0", 
    "fasttext==0.9.2", 
    "filelock==3.3.1", 
    "Flask==1.1.2", 
    "flatbuffers==1.12", 
    "fsspec==2021.8.1", 
    "future==0.18.2", 
    "gast==0.4.0", 
    "gitdb==4.0.9", 
    "GitPython==3.1.27", 
    "google-auth==2.6.0", 
    "google-auth-oauthlib==0.4.6", 
    "google-pasta==0.2.0", 
    "grpcio==1.44.0", 
    "gunicorn==20.1.0", 
    "gviz-api==1.10.0", 
    "h5py==3.3.0", 
    "hijri-converter==2.2.4", 
    "holidays==0.15", 
    "horovod==0.25.0", 
    "htmlmin==0.1.12", 
    "huggingface-hub==0.9.1", 
    "idna==3.2", 
    "ImageHash==4.3.0", 
    "imbalanced-learn==0.8.1", 
    "importlib-metadata==4.8.1", 
    "ipykernel==6.12.1", 
    "ipython==7.32.0", 
    "ipython-genutils==0.2.0", 
    "ipywidgets==7.7.0", 
    "isodate==0.6.1", 
    "itsdangerous==2.0.1", 
    "jedi==0.18.0", 
    "Jinja2==2.11.3", 
    "jmespath==0.10.0", 
    "joblib==1.0.1", 
    "joblibspark==0.5.0", 
    "jsonschema==3.2.0", 
    "jupyter-client==6.1.12", 
    "jupyter-core==4.8.1", 
    "jupyterlab-pygments==0.1.2", 
    "jupyterlab-widgets==1.0.0", 
    "keras==2.9.0", 
    "Keras-Preprocessing==1.1.2", 
    "kiwisolver==1.3.1", 
    "korean-lunar-calendar==0.3.1", 
    "langcodes==3.3.0", 
    "libclang==14.0.6", 
    "lightgbm==3.3.2", 
    "llvmlite==0.37.0", 
    "LunarCalendar==0.0.9", 
    "Mako==1.2.0", 
    "Markdown==3.3.6", 
    "MarkupSafe==2.0.1", 
    "matplotlib==3.4.3", 
    "matplotlib-inline==0.1.2", 
    "missingno==0.5.1", 
    "mistune==0.8.4", 
    "mleap==0.20.0", 
    "mlflow-skinny==1.29.0", 
    "multimethod==1.9", 
    "murmurhash==1.0.8", 
    "mypy-extensions==0.4.3", 
    "nbclient==0.5.3", 
    "nbconvert==6.1.0", 
    "nbformat==5.1.3", 
    "nest-asyncio==1.5.1", 
    "networkx==2.6.3", 
    "nltk==3.6.5", 
    "notebook==6.4.5", 
    "numba==0.54.1", 
    "numpy==1.20.3", 
    "oauthlib==3.2.0", 
    "opt-einsum==3.3.0", 
    "packaging==21.0", 
    "pandas==1.3.4", 
    "pandas-profiling==3.1.0", 
    "pandocfilters==1.4.3", 
    "paramiko==2.9.2", 
    "parso==0.8.2", 
    "pathspec==0.9.0", 
    "pathy==0.6.2", 
    "patsy==0.5.2", 
    "petastorm==0.11.4", 
    "pexpect==4.8.0", 
    "phik==0.12.2", 
    "pickleshare==0.7.5", 
    "Pillow==8.4.0", 
    "platformdirs==2.5.2", 
    "plotly==5.9.0", 
    "pmdarima==1.8.5", 
    "preshed==3.0.7", 
    "prometheus-client==0.11.0", 
    "prompt-toolkit==3.0.20", 
    "prophet==1.0.1", 
    "protobuf==3.19.4", 
    "psutil==5.8.0", 
    "psycopg2==2.9.3", 
    "ptyprocess==0.7.0", 
    "pyarrow==7.0.0", 
    "pyasn1==0.4.8", 
    "pyasn1-modules==0.2.8", 
    "pybind11==2.10.0", 
    "pycparser==2.20", 
    "pydantic==1.9.2", 
    "Pygments==2.10.0", 
    "PyGObject==3.36.0", 
    "PyJWT==2.5.0", 
    "PyMeeus==0.5.11", 
    "PyNaCl==1.5.0", 
    "pyodbc==4.0.31", 
    "pyparsing==3.0.4", 
    "pyrsistent==0.18.0", 
    "pystan==2.19.1.1", 
    "python-dateutil==2.8.2", 
    "python-editor==1.0.4", 
    "pytz==2021.3", 
    "PyWavelets==1.1.1", 
    "PyYAML==6.0", 
    "pyzmq==22.2.1", 
    "regex==2021.8.3", 
    "requests==2.26.0", 
    "requests-oauthlib==1.3.1", 
    "requests-unixsocket==0.2.0", 
    "rsa==4.9", 
    "s3transfer==0.5.2", 
    "scikit-learn==0.24.2", 
    "scipy==1.7.1", 
    "seaborn==0.11.2", 
    "Send2Trash==1.8.0", 
    "setuptools-git==1.2", 
    "shap==0.41.0", 
    "simplejson==3.17.6", 
    "six==1.16.0", 
    "slicer==0.0.7", 
    "smart-open==5.2.1", 
    "smmap==5.0.0", 
    "spacy==3.4.1", 
    "spacy-legacy==3.0.10", 
    "spacy-loggers==1.0.3", 
    "spark-tensorflow-distributor==1.0.0", 
    "sqlparse==0.4.2", 
    "srsly==2.4.4", 
    "ssh-import-id==5.10", 
    "statsmodels==0.12.2", 
    "tabulate==0.8.9", 
    "tangled-up-in-unicode==0.1.0", 
    "tenacity==8.0.1", 
    "tensorboard==2.9.1", 
    "tensorboard-data-server==0.6.1", 
    "tensorboard-plugin-profile==2.8.0", 
    "tensorboard-plugin-wit==1.8.1", 
    "tensorflow-cpu==2.9.1", 
    "tensorflow-estimator==2.9.0", 
    "tensorflow-io-gcs-filesystem==0.27.0", 
    "termcolor==2.0.1", 
    "terminado==0.9.4", 
    "testpath==0.5.0", 
    "thinc==8.1.2", 
    "threadpoolctl==2.2.0", 
    "tokenize-rt==4.2.1", 
    "tokenizers==0.12.1", 
    "tomli==2.0.1", 
    "torch==1.12.1+cpu", 
    "torchvision==0.13.1+cpu", 
    "tornado==6.1", 
    "tqdm==4.62.3", 
    "traitlets==5.1.0", 
    "transformers==4.21.2", 
    "typer==0.4.2", 
    "typing-extensions==3.10.0.2", 
    "ujson==4.0.2", 
    "urllib3==1.26.7", 
    "virtualenv==20.8.0", 
    "visions==0.7.4", 
    "wasabi==0.10.1", 
    "wcwidth==0.2.5", 
    "webencodings==0.5.1", 
    "websocket-client==1.3.1", 
    "Werkzeug==2.0.2", 
    "widgetsnbextension==3.6.0", 
    "wrapt==1.12.1", 
    "xgboost==1.6.2", 
    "zipp==3.6.0"
]
requires-python = "==3.9.*"
license = {text = "MIT"}

[build-system]
requires = ["pdm-pep517>=1.0.0"]
build-backend = "pdm.pep517.api"

[tool.pdm.overrides]
scikit-learn = "0.24.2" 

[[tool.pdm.source]]
type = "index-url"
url = "https://download.pytorch.org/whl/cpu/torch_stable.html"
name = "pytorch"

The Pytorch index must be expecting only Pytorch-related queries, since it usually reports 503s and 504s which for sure contribute to the slowness (it might be saturating it or exceeding query quotas).

Describe the solution you'd like

A way to tell pdm to rely on Pypi and use extra sources only when not found in Pypi (or a way to link specific dependencies to specific sources). (maybe there's a way to do it that I don't know of)

ferminho commented 1 year ago

Updated for clarifying both problem and proposal

AdamJel commented 1 year ago

Hi, have you tried specifying resolution order?

From your description of the problem, this should solve it.

ferminho commented 1 year ago

Hi, yes @AdamJel , thanks for the proposal. I did try that, but it seems that resolution is performed after downloading information from all sources, so the resulting times are the same even if the packages are present in the first source.

I did the test with respect-source-order like this:

[[tool.pdm.source]]
url = "https://pypi.org/simple"
name = "pypi"
verify_ssl = true

[[tool.pdm.source]]
type = "index-url"
url = "https://download.pytorch.org/whl/cpu/torch_stable.html"
name = "pytorch"

[tool.pdm.resolution]
respect-source-order = true
frostming commented 1 year ago

@ferminho It is due to the nature of the resolution process and needs non-trivial effort to change it.

The finder collects packages, and the resolver decides whether a particular package matches. So a finder can't decide to stop collecting by itself. A practical example is, there are version 1 on the private index, and version 2 on the fallback index. If the current dependency set only accepts version 2, but the package finder doesn't find that, it will cause a resolution failure.

ferminho commented 1 year ago

Thanks for the insights @frostming , doesn't seem easy to do what I proposed, indeed. What about being able to specify a limited set of packages for a source, so the finder filters the packages to search for?

If you think that would be a viable option, I am willing to help, I can try to do it.

sanmai-NL commented 1 year ago

See also https://github.com/pdm-project/pdm/issues/1645.

martolini commented 1 year ago

I'm facing the same problem as torch has decided to ship their CPU version on a custom index. Resolving all packages against their index is super-slow. Did you find a workaround? Right now all our CI systems and whatnot is resolving against the super-slow torch index 😭

ferminho commented 1 year ago

We didn't find a proper workaround, so we are (mostly) still stuck with the pip-managed Databricks environment. However, we know this won't work for us when we want to run production code in the platform.

baggiponte commented 1 year ago

Would only like to point out that "use extra sources only when dependency not present in PyPI" is not optimal in terms of security. A malicious actor could upload a package with the same name as your private one on the public pypi but with a greater version and the dependency manager would install it: see here.

ferminho commented 1 year ago

Good point, I agree. Binding specific packages and sources seems indeed like the best solution without this kind of security concerns, as suggested in #1645 so I'm going to close the issue to focus on the other thread :+1: