thoth-station / thoth-application

Thoth-Station ArgoCD Applications
GNU General Public License v3.0
12 stars 22 forks source link

[spike] solver failing due to OOM #2690

Open harshad16 opened 1 year ago

harshad16 commented 1 year ago

Describe the bug

The solver solves packages for based on the index_url(for ex: pypi/simple) and further resolve the dependencies. As dependencies of many package present in one index_url can be found on a different index_url, in the logic we currently provide all the dependencies URLs for resolution.

The solver with all the dependencies index URLs are failing to execute and runs into OOM.

Screenshot from 2022-10-27 15-19-06 Screenshot from 2022-10-27 15-18-25

To Reproduce Steps to reproduce the behavior:

  1. Go to thoth-middlertier-stage namespace in cluster
  2. Click on solver pods
  3. See error

Expected behavior successful execution of solvers.

Acceptance criteria

harshad16 commented 1 year ago

/priority important-soon /sig devsecops

harshad16 commented 1 year ago

The Report on the OOM failure diagonsis happening in solvers:

Solver execution extract various information from a package and its dependencies. One of the aspect is to check if the package is installable or not, that is done via fnc: https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L93

The installation is checked by installing the package via pip in the virtualenv. The command to install the package via pip is generated and executed via thoth-analyzer. https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L110 https://github.com/thoth-station/analyzer/blob/ad12a1ed76ff6aa1606dae3efb47e3bb8d5af61f/thoth/analyzer/command.py#L99

  1. The command constructed is having quotation issue: Generated in Debug mode: cmd: "Running command 'venv/bin/python3 -m pip install --force-reinstall --no-cache-dir --no-deps torch===1.12.1+cu113 --index-url \"https://download.pytorch.org/whl/cu113\" --trusted-host download.pytorch.org'"

Causing the cmd execution to fail.

WARNING: The index url ""https://download.pytorch.org/whl/cu113"" seems invalid, please provide a scheme.
Looking in indexes: "https://download.pytorch.org/whl/cu113"
WARNING: Location '"https://download.pytorch.org/whl/cu113"/torch/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement torch===1.12.1+cu113 (from versions: none)
ERROR: No matching distribution found for torch===1.12.1+cu113
  1. If the command is invalid, this process fails and the exception is caught. Though if the command is valid, it just doesn't complete the execution, the delegator is on wait. for example: when the solver is executed for package roundup===2.1.0 cmd: "python3 -m pip install --force-reinstall --no-cache-dir --no-deps roundup===2.1.0 --index-url https://pypi.org/simple --trusted-host pypi.org" extracted from the debug method this would not finish in time, which would drop the solver execution in cluster, due to timeout.

  2. The solver which is able to execute the installation, though has a package that is bigger in size like torch. The extraction of the _hashes for that package artifacts consumes all the CPU. Screenshot from 2022-11-29 14-43-04 Execution of this function is consuming all the CPU allotted i.e 100m https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229 Though CPU throttle might not be the reason for OOM kill, this is just one aspect found. seems like the extraction of artifacts for hashes is somehow causing a memory leak.

    Solving the following bits might resolve the execution of failed solvers. and provide more information on the OOM failed solvers

harshad16 commented 1 year ago

As speculated above: The function fill_hashes https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229 gathers hashes from the artifacts: https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/source.py#L441 which is been download to the tmp files https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/artifact.py#L59

As our current memory limit is 768Mi, this memory is consumed on the download size. will verify this, by replicating it.

harshad16 commented 1 year ago

suggestion: