Open harshad16 opened 2 years ago
/priority important-soon /sig devsecops
The Report on the OOM failure diagonsis happening in solvers:
Solver execution extract various information from a package and its dependencies. One of the aspect is to check if the package is installable or not, that is done via fnc: https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L93
The installation is checked by installing the package via pip in the virtualenv. The command to install the package via pip is generated and executed via thoth-analyzer. https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L110 https://github.com/thoth-station/analyzer/blob/ad12a1ed76ff6aa1606dae3efb47e3bb8d5af61f/thoth/analyzer/command.py#L99
cmd: "Running command 'venv/bin/python3 -m pip install --force-reinstall --no-cache-dir --no-deps torch===1.12.1+cu113 --index-url \"https://download.pytorch.org/whl/cu113\" --trusted-host download.pytorch.org'"
Causing the cmd execution to fail.
WARNING: The index url ""https://download.pytorch.org/whl/cu113"" seems invalid, please provide a scheme.
Looking in indexes: "https://download.pytorch.org/whl/cu113"
WARNING: Location '"https://download.pytorch.org/whl/cu113"/torch/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement torch===1.12.1+cu113 (from versions: none)
ERROR: No matching distribution found for torch===1.12.1+cu113
If the command is invalid, this process fails and the exception is caught.
Though if the command is valid, it just doesn't complete the execution, the delegator is on wait.
for example: when the solver is executed for package roundup===2.1.0
cmd: "python3 -m pip install --force-reinstall --no-cache-dir --no-deps roundup===2.1.0 --index-url https://pypi.org/simple --trusted-host pypi.org"
extracted from the debug method this would not finish in time, which would drop the solver execution in cluster, due to timeout.
The solver which is able to execute the installation, though has a package that is bigger in size like torch
.
The extraction of the _hashes for that package artifacts consumes all the CPU.
Execution of this function is consuming all the CPU allotted i.e 100m
https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229
Though CPU throttle might not be the reason for OOM kill, this is just one aspect found.
seems like the extraction of artifacts for hashes is somehow causing a memory leak.
Solving the following bits might resolve the execution of failed solvers. and provide more information on the OOM failed solvers
As speculated above: The function fill_hashes https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229 gathers hashes from the artifacts: https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/source.py#L441 which is been download to the tmp files https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/artifact.py#L59
As our current memory limit is 768Mi, this memory is consumed on the download size. will verify this, by replicating it.
suggestion:
Describe the bug
The solver solves packages for based on the index_url(for ex: pypi/simple) and further resolve the dependencies. As dependencies of many package present in one index_url can be found on a different index_url, in the logic we currently provide all the dependencies URLs for resolution.
The solver with all the dependencies index URLs are failing to execute and runs into OOM.
To Reproduce Steps to reproduce the behavior:
Expected behavior successful execution of solvers.
Acceptance criteria