Closed pacospace closed 2 years ago
/priority important-soon /assign @codificat /triage accepted
any update on this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
In the last integration test runs for aws-prod there are errors in some of the ps-* tests: ps-cv-{pytorch,tensorflow} and ps-nlp-tensorflow due to tmieouts:
2022-05-02 03:41:12,899 thoth.adviser.run ERROR: Child exited with exit code 10
2022-05-02 03:25:01,696 thoth.adviser.run ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded
Other related integration tests succeeded.
In the last run of integration tests for smaug-prod, ps-* tests failed with HTTP 400 codes (bad request), e.g.
Then I ask for an advise for the cloned application for runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis (52.758s)
Error Message
Traceback (most recent call last):
File "/opt/app-root/lib64/python3.8/site-packages/behave/model.py", line 1329, in run
match.run(runner.context)
File "/opt/app-root/lib64/python3.8/site-packages/behave/matchers.py", line 98, in run
self.func(context, *args, **kwargs)
File "features/steps/advise.py", line 248, in step_impl
results = advise_using_config(
File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 397, in advise_using_config
return advise(
File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 118, in wrapper
result = func(api_client, *args, **kwargs)
File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 583, in advise
response = _retrieve_analysis_result(
File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 276, in _retrieve_analysis_result
return retrieve_func(analysis_id)
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 53, in get_advise_python
(data) = self.get_advise_python_with_http_info(analysis_id, **kwargs) # noqa: E501
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 112, in get_advise_python_with_http_info
return self.api_client.call_api(
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 316, in call_api
return self.__call_api(resource_path, method,
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 148, in __call_api
response_data = self.request(
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 338, in request
return self.rest_client.GET(url,
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 228, in GET
return self.request("GET", url,
File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 222, in request
raise ApiException(http_resp=r)
thamos.swagger_client.rest.ApiException: (400)
Reason: BAD REQUEST
HTTP response headers: HTTPHeaderDict({'server': 'gunicorn', 'date': 'Thu, 28 Apr 2022 01:05:55 GMT', 'content-type': 'application/json', 'content-length': '272', 'x-thoth-version': '0.34.14', 'x-user-api-service-version': '0.34.14+messaging.0.16.0.storages.0.71.1.common.0.36.0.python.0.16.9', 'x-thoth-search-ui-url': 'https://thoth-station.ninja/search/', 'access-control-allow-origin': '*', 'set-cookie': '829f3dbab311aaac0d90f580d731991c=d36e665b294c43e30415dbb1b2323809; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: b'{\n "error": "Analysis was not successful",\n "parameters": {\n "analysis_id": "adviser-220428010502-f22f7444ce59c173"\n },\n "status": {\n "finished_at": "2022-04-28T01:05:48Z",\n "reason": null,\n "started_at": "2022-04-28T01:05:03Z",\n "state": "error"\n }\n}\n'
/milestone OKR review Q2 2022 /sig user-experience
/remove-sig user-experience /sig stack-guidance
because there are issues resolving the stacks here
The last integration-tests report (Integration tests update for ocp4-stage (2022-05-03 version 0.11.2)
) has the following scenarios failing:
ps-nlp-tensorflow
ps-nlp-pytorch
ps-cv-tensorflow
All of them use latest
recommendation type. The predictor used in the adviser implementation in that cases uses "hops" when it randomly takes some path in the resolution process if solely the latest versions cannot be resolved. It might be that this implementation is not perfect in these cases and it would be better to provide an implementation that would use backtracking (similarly as pip, but offline using the dependency information from the database - see https://github.com/thoth-station/adviser/issues/2329).
These issues can be also supported with the following solving error described in https://github.com/thoth-station/integration-tests/issues/266#issuecomment-1060822639. Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails obtaining dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.
To introspect what is happening here, we might:
stable
which uses resolution algorithm based on reinforcement learning and see if it finds a resolutionAlso, we can try using user stack scoring and see how the resolver behaves with specific versions of libraries to narrow down to possible issue maker.
Tested with stable
recommendation type:
Tested with latest
recommendation type without jupyter-tensorboard package in the stack:
Tested with latest
recommendation type and jupyter-tensorboard==0.1.1
(solvable using our solver):
Possible fixes:
Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.
This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right? Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?
Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.
This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?
Generally, no - we are not able to solve libraries that have hard requirements on environment that are not met in our solvers. Ideally, jupyter-tensorboard should not depend on the environment and execute code during the installation process - at least not make it a hard requirement (if it fails, the installed package can still be present).
This might get better over time as python packaging evolves (and provides static wheel metadata).
Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?
There can be found versions that were removed in the stack info provided to the user:
"The following versions of 'jupyter-tensorboard' from 'https://pypi.org/simple' were removed due to installation issues in the target environment: 0.2.0, 0.1.10, 0.1.9, 0.1.8, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.4.dev0, 0.1.3, 0.1.3.dev0, 0.1.2, 0.1.2.dev1, 0.1.2.dev0"
Thoth also suggested to use it, for example in the first successful resolution with stable
recommendation type:
- ps-nlp-tensorflow succeeded - see results
Thoth also suggested to use it, for example in the first successful resolution with
stable
recommendation type:
- ps-nlp-tensorflow succeeded - see results
And for others, it looks like it failed as it did not find any resolution in the allocated time.
ack, thanks for the explanation.
/remove-label human_intervention_required
@codificat: The label(s) /remove-label human_intervention_required
cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected
The integration tests in stage are suffering from cluster issues that have been going on for a while and are expected to take some more time to fix.
Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest
recommendation type (what is currently configured in .thoth.yaml
). Other recommendation types had a few failures.
The recommendations that are failing fail with the following message:
Resolver did not find any stack that would satisfy requirements and stack characteristics given the time allocated - see https://thoth-station.ninja/j/no_stack
Below is the current status with each stack.
overlay | type | result | advise ID | time |
---|---|---|---|---|
ps-nlp | latest | success | adviser-220613143911-3483a20bdb243903 | 49s |
ps-nlp-tensorflow | latest | success | adviser-220613144121-b52e7ffa560ce6c6 | 1m 47s |
ps-nlp-tensorflow-gpu | latest | success | adviser-220613144345-97b1d4046d403a80 | 2m 4s |
ps-nlp-pytorch | latest | success | adviser-220613115118-b406e285a0ae8618 | 2m 2s |
ps-nlp | stable | success | adviser-220614064546-327778099d10c008 | 34m 35s |
ps-nlp-tensorflow | stable | failure | adviser-220614075651-ef920679375c9a8f | 26m 7s |
ps-nlp-tensorflow-gpu | stable | success | adviser-220614160559-2f1a80945ee22db5 | 26m 30s |
ps-nlp-pytorch | stable | success | adviser-220614165324-519f4729bc0fbb77 | 26m 30s |
ps-nlp | security | success | adviser-220614110005-7b9c92d2284d37fc | 16m 40s |
ps-nlp-tensorflow | security | success | adviser-220614084155-3ca5965c25ece6d8 | 19m 33s |
ps-nlp-tensorflow-gpu | security | success | adviser-220614172749-ea72fedac3595022 | 17m 55s |
ps-nlp-pytorch | security | success | adviser-220614120043-6e7ea342826ad597 | 25m 23s |
ps-nlp | performance | success | adviser-220614072310-4fe6535e416419a8 | 26m 22s |
ps-nlp-tensorflow | performance | success | adviser-220614134333-d76be662fa33319a | 26m 29s |
ps-nlp-tensorflow-gpu | performance | failure | adviser-220614153715-1fc19007f995c727 | 26m 9s |
ps-nlp-pytorch | performance | failure | adviser-220614150618-eba5b4e3183fd0f2 | 26m 14s |
overlay | type | result | advise ID | time |
---|---|---|---|---|
ps-cv-ocr | latest | success | adviser-220613144932-7c569b4d4585fd54 | 22s |
ps-cv-tensorflow | latest | success | adviser-220613145241-6d5d7bdc27ac3a3a | 1m 18s |
ps-cv-pytorch | latest | success | adviser-220613145031-fcb0a951d8adb577 | 1m 44s |
ps-cv-ocr | stable | success | adviser-220613184700-1f90f53ccf4159b1 | 2m 24s |
ps-cv-tensorflow | stable | failure | adviser-220613162944-6d3bf6b86a373e6d | 22m 49s |
ps-cv-pytorch | stable | failure | adviser-220613180701-d348b4ef9c9b3e87 | 26m 17s |
ps-cv-ocr | performance | success | adviser-220613185358-fb8309ed55dd32d9 | 2m 7s |
ps-cv-tensorflow | performance | failure | adviser-220613210546-7078358b1944fd96 | 26m 8s |
ps-cv-pytorch | performance | failure | adviser-220613192843-85146a8c21afc8ee | 27m 3s |
ps-cv-ocr | security | success | adviser-220614183752-2991f946845d4af | 27s |
ps-cv-tensorflow | security | failure | adviser-220614180129-becf1431eab85efe | 27m 46s |
ps-cv-pytorch | security | failure | adviser-220614183855-da15ed8545b4868 | 26m 13s |
overlay | type | result | advise ID | time |
---|---|---|---|---|
ps-ip-ifd | latest | success | adviser-220613145447-b8f73428af85d2bc | 31s |
ps-ip-ifd | stable | success | adviser-220613160047-b5ae72918b1150b7 | 20m 56s |
ps-ip-ifd | performance | success | adviser-220613185732-2c9fc59a216df36a | 23m 17s |
ps-ip-ifd | security | success | adviser-220614174631-d062f02ad3a261bf | 54s |
Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the
latest
recommendation type (what is currently configured in.thoth.yaml
).
Based on this, I believe we can /close this one as complete.
We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. https://github.com/thoth-station/integration-tests/issues/324), and possibly review the justification related to the failures on some combination of stack/type.
These are tracked in separate issues as appropriate.
@codificat: Closing this issue.
Describe the bug As User of Thoth PS images,
I want to have continous updates on software stacks to be maintained by Thoth services.
To Reproduce Steps to reproduce the behavior:
Expected behavior All ps-* stacks can be advised by Thoth (all integration tests are green for ps-stacks: https://github.com/thoth-station/integration-tests/issues/204)
Screenshots
Additional context ps-*: