[5pt] Make sure ps-stacks can receive recommendation from Thoth

pacospace commented 3 years ago

Describe the bug As User of Thoth PS images,

I want to have continous updates on software stacks to be maintained by Thoth services.

To Reproduce Steps to reproduce the behavior:

Run thamos advise on all ps-stacks

Expected behavior All ps-* stacks can be advised by Thoth (all integration tests are green for ps-stacks: https://github.com/thoth-station/integration-tests/issues/204)

Screenshots

Additional context ps-*:

goern commented 3 years ago

/priority important-soon /assign @codificat /triage accepted

goern commented 2 years ago

any update on this?

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

codificat commented 2 years ago

/remove-lifecycle stale

codificat commented 2 years ago

In the last integration test runs for aws-prod there are errors in some of the ps-* tests: ps-cv-{pytorch,tensorflow} and ps-nlp-tensorflow due to tmieouts:

2022-05-02 03:41:12,899 thoth.adviser.run           ERROR: Child exited with exit code 10
2022-05-02 03:25:01,696 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Other related integration tests succeeded.

In the last run of integration tests for smaug-prod, ps-* tests failed with HTTP 400 codes (bad request), e.g.

Then I ask for an advise for the cloned application for runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis (52.758s) 
Error Message

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/behave/model.py", line 1329, in run
    match.run(runner.context)
  File "/opt/app-root/lib64/python3.8/site-packages/behave/matchers.py", line 98, in run
    self.func(context, *args, **kwargs)
  File "features/steps/advise.py", line 248, in step_impl
    results = advise_using_config(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 397, in advise_using_config
    return advise(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 118, in wrapper
    result = func(api_client, *args, **kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 583, in advise
    response = _retrieve_analysis_result(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 276, in _retrieve_analysis_result
    return retrieve_func(analysis_id)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 53, in get_advise_python
    (data) = self.get_advise_python_with_http_info(analysis_id, **kwargs)  # noqa: E501
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 112, in get_advise_python_with_http_info
    return self.api_client.call_api(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 316, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 148, in __call_api
    response_data = self.request(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 338, in request
    return self.rest_client.GET(url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 228, in GET
    return self.request("GET", url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
thamos.swagger_client.rest.ApiException: (400)
Reason: BAD REQUEST
HTTP response headers: HTTPHeaderDict({'server': 'gunicorn', 'date': 'Thu, 28 Apr 2022 01:05:55 GMT', 'content-type': 'application/json', 'content-length': '272', 'x-thoth-version': '0.34.14', 'x-user-api-service-version': '0.34.14+messaging.0.16.0.storages.0.71.1.common.0.36.0.python.0.16.9', 'x-thoth-search-ui-url': 'https://thoth-station.ninja/search/', 'access-control-allow-origin': '*', 'set-cookie': '829f3dbab311aaac0d90f580d731991c=d36e665b294c43e30415dbb1b2323809; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: b'{\n  "error": "Analysis was not successful",\n  "parameters": {\n    "analysis_id": "adviser-220428010502-f22f7444ce59c173"\n  },\n  "status": {\n    "finished_at": "2022-04-28T01:05:48Z",\n    "reason": null,\n    "started_at": "2022-04-28T01:05:03Z",\n    "state": "error"\n  }\n}\n'

codificat commented 2 years ago

/milestone OKR review Q2 2022 /sig user-experience

codificat commented 2 years ago

/remove-sig user-experience /sig stack-guidance

because there are issues resolving the stacks here

fridex commented 2 years ago

The last integration-tests report (Integration tests update for ocp4-stage (2022-05-03 version 0.11.2)) has the following scenarios failing:

ps-nlp-tensorflow
ps-nlp-pytorch
ps-cv-tensorflow

All of them use latest recommendation type. The predictor used in the adviser implementation in that cases uses "hops" when it randomly takes some path in the resolution process if solely the latest versions cannot be resolved. It might be that this implementation is not perfect in these cases and it would be better to provide an implementation that would use backtracking (similarly as pip, but offline using the dependency information from the database - see https://github.com/thoth-station/adviser/issues/2329).

These issues can be also supported with the following solving error described in https://github.com/thoth-station/integration-tests/issues/266#issuecomment-1060822639. Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails obtaining dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

To introspect what is happening here, we might:

try to remove jupyter-tensorboard from requirements and try to ask for an advise using latest recommendation type
try to run adviser with different recommendation type set, such as stable which uses resolution algorithm based on reinforcement learning and see if it finds a resolution
try to manually pin version of jupyter-tensorboard (older version that is solvable by thoth-solver) and see if the resolution process finds a solution even for the latest recommendation type

Also, we can try using user stack scoring and see how the resolver behaves with specific versions of libraries to narrow down to possible issue maker.

fridex commented 2 years ago

Tested with stable recommendation type:

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch failed - see results
ps-cv-tensorflow failed - see results

fridex commented 2 years ago

Tested with latest recommendation type without jupyter-tensorboard package in the stack:

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch suceeded - see results
ps-cv-tensorflow succeeded - see results

fridex commented 2 years ago

Tested with latest recommendation type and jupyter-tensorboard==0.1.1 (solvable using our solver):

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch succeeded - see results
ps-cv-tensorflow succeeded - see results

fridex commented 2 years ago

Possible fixes:

use jupyter-tensorboard==0.1.1 in all the stacks that use it
remove jupyter-tensorboard (if it is not used)
contact jupyter-tensorboard upstream for a possible fix - so that it does not have hard requirements on packages to be present in the environment during installation
patch jupyter-tensorboard ourselves and host a patched version on our Pulp Python Package Index

harshad16 commented 2 years ago

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right? Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

fridex commented 2 years ago

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?

Generally, no - we are not able to solve libraries that have hard requirements on environment that are not met in our solvers. Ideally, jupyter-tensorboard should not depend on the environment and execute code during the installation process - at least not make it a hard requirement (if it fails, the installed package can still be present).

This might get better over time as python packaging evolves (and provides static wheel metadata).

Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

There can be found versions that were removed in the stack info provided to the user:

"The following versions of 'jupyter-tensorboard' from 'https://pypi.org/simple' were removed due to installation issues in the target environment: 0.2.0, 0.1.10, 0.1.9, 0.1.8, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.4.dev0, 0.1.3, 0.1.3.dev0, 0.1.2, 0.1.2.dev1, 0.1.2.dev0"

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

ps-nlp-tensorflow succeeded - see results

fridex commented 2 years ago

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

ps-nlp-tensorflow succeeded - see results

And for others, it looks like it failed as it did not find any resolution in the allocated time.

harshad16 commented 2 years ago

ack, thanks for the explanation.

codificat commented 2 years ago

/remove-label human_intervention_required

sesheta commented 2 years ago

@codificat: The label(s) /remove-label human_intervention_required cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to [this](https://github.com/thoth-station/core/issues/326#issuecomment-1119471640): >/remove-label human_intervention_required Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

codificat commented 2 years ago

The integration tests in stage are suffering from cluster issues that have been going on for a while and are expected to take some more time to fix.

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml). Other recommendation types had a few failures.

The recommendations that are failing fail with the following message:

Resolver did not find any stack that would satisfy requirements and stack characteristics given the time allocated - see https://thoth-station.ninja/j/no_stack

Below is the current status with each stack.

ps-nlp

overlay	type	result	advise ID	time
ps-nlp	latest	success	adviser-220613143911-3483a20bdb243903	49s
ps-nlp-tensorflow	latest	success	adviser-220613144121-b52e7ffa560ce6c6	1m 47s
ps-nlp-tensorflow-gpu	latest	success	adviser-220613144345-97b1d4046d403a80	2m 4s
ps-nlp-pytorch	latest	success	adviser-220613115118-b406e285a0ae8618	2m 2s
ps-nlp	stable	success	adviser-220614064546-327778099d10c008	34m 35s
ps-nlp-tensorflow	stable	failure	adviser-220614075651-ef920679375c9a8f	26m 7s
ps-nlp-tensorflow-gpu	stable	success	adviser-220614160559-2f1a80945ee22db5	26m 30s
ps-nlp-pytorch	stable	success	adviser-220614165324-519f4729bc0fbb77	26m 30s
ps-nlp	security	success	adviser-220614110005-7b9c92d2284d37fc	16m 40s
ps-nlp-tensorflow	security	success	adviser-220614084155-3ca5965c25ece6d8	19m 33s
ps-nlp-tensorflow-gpu	security	success	adviser-220614172749-ea72fedac3595022	17m 55s
ps-nlp-pytorch	security	success	adviser-220614120043-6e7ea342826ad597	25m 23s
ps-nlp	performance	success	adviser-220614072310-4fe6535e416419a8	26m 22s
ps-nlp-tensorflow	performance	success	adviser-220614134333-d76be662fa33319a	26m 29s
ps-nlp-tensorflow-gpu	performance	failure	adviser-220614153715-1fc19007f995c727	26m 9s
ps-nlp-pytorch	performance	failure	adviser-220614150618-eba5b4e3183fd0f2	26m 14s

ps-cv

overlay	type	result	advise ID	time
ps-cv-ocr	latest	success	adviser-220613144932-7c569b4d4585fd54	22s
ps-cv-tensorflow	latest	success	adviser-220613145241-6d5d7bdc27ac3a3a	1m 18s
ps-cv-pytorch	latest	success	adviser-220613145031-fcb0a951d8adb577	1m 44s
ps-cv-ocr	stable	success	adviser-220613184700-1f90f53ccf4159b1	2m 24s
ps-cv-tensorflow	stable	failure	adviser-220613162944-6d3bf6b86a373e6d	22m 49s
ps-cv-pytorch	stable	failure	adviser-220613180701-d348b4ef9c9b3e87	26m 17s
ps-cv-ocr	performance	success	adviser-220613185358-fb8309ed55dd32d9	2m 7s
ps-cv-tensorflow	performance	failure	adviser-220613210546-7078358b1944fd96	26m 8s
ps-cv-pytorch	performance	failure	adviser-220613192843-85146a8c21afc8ee	27m 3s
ps-cv-ocr	security	success	adviser-220614183752-2991f946845d4af	27s
ps-cv-tensorflow	security	failure	adviser-220614180129-becf1431eab85efe	27m 46s
ps-cv-pytorch	security	failure	adviser-220614183855-da15ed8545b4868	26m 13s

ps-ip

overlay	type	result	advise ID	time
ps-ip-ifd	latest	success	adviser-220613145447-b8f73428af85d2bc	31s
ps-ip-ifd	stable	success	adviser-220613160047-b5ae72918b1150b7	20m 56s
ps-ip-ifd	performance	success	adviser-220613185732-2c9fc59a216df36a	23m 17s
ps-ip-ifd	security	success	adviser-220614174631-d062f02ad3a261bf	54s

codificat commented 2 years ago

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can /close this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. https://github.com/thoth-station/integration-tests/issues/324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

sesheta commented 2 years ago

@codificat: Closing this issue.

In response to [this](https://github.com/thoth-station/core/issues/326#issuecomment-1161778695): >> Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the `latest` recommendation type (what is currently configured in `.thoth.yaml`). > >Based on this, I believe we can >/close >this one as complete. > >We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. https://github.com/thoth-station/integration-tests/issues/324), and possibly review the justification related to the failures on some combination of stack/type. > >These are tracked in separate issues as appropriate. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

thoth-station / core