thoth-station / core

Using Artificial Intelligence to analyse and recommend Software Stacks for Artificial Intelligence applications.
https://thoth-station.github.io/
GNU General Public License v3.0
28 stars 25 forks source link

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

Closed pacospace closed 2 years ago

pacospace commented 3 years ago

Describe the bug As User of Thoth PS images,

I want to have continous updates on software stacks to be maintained by Thoth services.

To Reproduce Steps to reproduce the behavior:

  1. Run thamos advise on all ps-stacks

Expected behavior All ps-* stacks can be advised by Thoth (all integration tests are green for ps-stacks: https://github.com/thoth-station/integration-tests/issues/204)

Screenshots

Additional context ps-*:

goern commented 3 years ago

/priority important-soon /assign @codificat /triage accepted

goern commented 2 years ago

any update on this?

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

codificat commented 2 years ago

/remove-lifecycle stale

codificat commented 2 years ago

In the last integration test runs for aws-prod there are errors in some of the ps-* tests: ps-cv-{pytorch,tensorflow} and ps-nlp-tensorflow due to tmieouts:

2022-05-02 03:41:12,899 thoth.adviser.run           ERROR: Child exited with exit code 10
2022-05-02 03:25:01,696 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Other related integration tests succeeded.

In the last run of integration tests for smaug-prod, ps-* tests failed with HTTP 400 codes (bad request), e.g.

Then I ask for an advise for the cloned application for runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis (52.758s) 
Error Message

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/behave/model.py", line 1329, in run
    match.run(runner.context)
  File "/opt/app-root/lib64/python3.8/site-packages/behave/matchers.py", line 98, in run
    self.func(context, *args, **kwargs)
  File "features/steps/advise.py", line 248, in step_impl
    results = advise_using_config(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 397, in advise_using_config
    return advise(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 118, in wrapper
    result = func(api_client, *args, **kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 583, in advise
    response = _retrieve_analysis_result(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 276, in _retrieve_analysis_result
    return retrieve_func(analysis_id)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 53, in get_advise_python
    (data) = self.get_advise_python_with_http_info(analysis_id, **kwargs)  # noqa: E501
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 112, in get_advise_python_with_http_info
    return self.api_client.call_api(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 316, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 148, in __call_api
    response_data = self.request(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 338, in request
    return self.rest_client.GET(url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 228, in GET
    return self.request("GET", url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
thamos.swagger_client.rest.ApiException: (400)
Reason: BAD REQUEST
HTTP response headers: HTTPHeaderDict({'server': 'gunicorn', 'date': 'Thu, 28 Apr 2022 01:05:55 GMT', 'content-type': 'application/json', 'content-length': '272', 'x-thoth-version': '0.34.14', 'x-user-api-service-version': '0.34.14+messaging.0.16.0.storages.0.71.1.common.0.36.0.python.0.16.9', 'x-thoth-search-ui-url': 'https://thoth-station.ninja/search/', 'access-control-allow-origin': '*', 'set-cookie': '829f3dbab311aaac0d90f580d731991c=d36e665b294c43e30415dbb1b2323809; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: b'{\n  "error": "Analysis was not successful",\n  "parameters": {\n    "analysis_id": "adviser-220428010502-f22f7444ce59c173"\n  },\n  "status": {\n    "finished_at": "2022-04-28T01:05:48Z",\n    "reason": null,\n    "started_at": "2022-04-28T01:05:03Z",\n    "state": "error"\n  }\n}\n'
codificat commented 2 years ago

/milestone OKR review Q2 2022 /sig user-experience

codificat commented 2 years ago

/remove-sig user-experience /sig stack-guidance

because there are issues resolving the stacks here

fridex commented 2 years ago

The last integration-tests report (Integration tests update for ocp4-stage (2022-05-03 version 0.11.2)) has the following scenarios failing:

All of them use latest recommendation type. The predictor used in the adviser implementation in that cases uses "hops" when it randomly takes some path in the resolution process if solely the latest versions cannot be resolved. It might be that this implementation is not perfect in these cases and it would be better to provide an implementation that would use backtracking (similarly as pip, but offline using the dependency information from the database - see https://github.com/thoth-station/adviser/issues/2329).

These issues can be also supported with the following solving error described in https://github.com/thoth-station/integration-tests/issues/266#issuecomment-1060822639. Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails obtaining dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

To introspect what is happening here, we might:

  1. try to remove jupyter-tensorboard from requirements and try to ask for an advise using latest recommendation type
  2. try to run adviser with different recommendation type set, such as stable which uses resolution algorithm based on reinforcement learning and see if it finds a resolution
  3. try to manually pin version of jupyter-tensorboard (older version that is solvable by thoth-solver) and see if the resolution process finds a solution even for the latest recommendation type

Also, we can try using user stack scoring and see how the resolver behaves with specific versions of libraries to narrow down to possible issue maker.

fridex commented 2 years ago

Tested with stable recommendation type:

fridex commented 2 years ago

Tested with latest recommendation type without jupyter-tensorboard package in the stack:

fridex commented 2 years ago

Tested with latest recommendation type and jupyter-tensorboard==0.1.1 (solvable using our solver):

fridex commented 2 years ago

Possible fixes:

  1. use jupyter-tensorboard==0.1.1 in all the stacks that use it
  2. remove jupyter-tensorboard (if it is not used)
  3. contact jupyter-tensorboard upstream for a possible fix - so that it does not have hard requirements on packages to be present in the environment during installation
  4. patch jupyter-tensorboard ourselves and host a patched version on our Pulp Python Package Index
harshad16 commented 2 years ago

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right? Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

fridex commented 2 years ago

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?

Generally, no - we are not able to solve libraries that have hard requirements on environment that are not met in our solvers. Ideally, jupyter-tensorboard should not depend on the environment and execute code during the installation process - at least not make it a hard requirement (if it fails, the installed package can still be present).

This might get better over time as python packaging evolves (and provides static wheel metadata).

Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

There can be found versions that were removed in the stack info provided to the user:

"The following versions of 'jupyter-tensorboard' from 'https://pypi.org/simple' were removed due to installation issues in the target environment: 0.2.0, 0.1.10, 0.1.9, 0.1.8, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.4.dev0, 0.1.3, 0.1.3.dev0, 0.1.2, 0.1.2.dev1, 0.1.2.dev0"

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

fridex commented 2 years ago

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

And for others, it looks like it failed as it did not find any resolution in the allocated time.

harshad16 commented 2 years ago

ack, thanks for the explanation.

codificat commented 2 years ago

/remove-label human_intervention_required

sesheta commented 2 years ago

@codificat: The label(s) /remove-label human_intervention_required cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to [this](https://github.com/thoth-station/core/issues/326#issuecomment-1119471640): >/remove-label human_intervention_required Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
codificat commented 2 years ago

The integration tests in stage are suffering from cluster issues that have been going on for a while and are expected to take some more time to fix.

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml). Other recommendation types had a few failures.

The recommendations that are failing fail with the following message:

Resolver did not find any stack that would satisfy requirements and stack characteristics given the time allocated - see https://thoth-station.ninja/j/no_stack

Below is the current status with each stack.

ps-nlp

overlay type result advise ID time
ps-nlp latest success adviser-220613143911-3483a20bdb243903 49s
ps-nlp-tensorflow latest success adviser-220613144121-b52e7ffa560ce6c6 1m 47s
ps-nlp-tensorflow-gpu latest success adviser-220613144345-97b1d4046d403a80 2m 4s
ps-nlp-pytorch latest success adviser-220613115118-b406e285a0ae8618 2m 2s
ps-nlp stable success adviser-220614064546-327778099d10c008 34m 35s
ps-nlp-tensorflow stable failure adviser-220614075651-ef920679375c9a8f 26m 7s
ps-nlp-tensorflow-gpu stable success adviser-220614160559-2f1a80945ee22db5 26m 30s
ps-nlp-pytorch stable success adviser-220614165324-519f4729bc0fbb77 26m 30s
ps-nlp security success adviser-220614110005-7b9c92d2284d37fc 16m 40s
ps-nlp-tensorflow security success adviser-220614084155-3ca5965c25ece6d8 19m 33s
ps-nlp-tensorflow-gpu security success adviser-220614172749-ea72fedac3595022 17m 55s
ps-nlp-pytorch security success adviser-220614120043-6e7ea342826ad597 25m 23s
ps-nlp performance success adviser-220614072310-4fe6535e416419a8 26m 22s
ps-nlp-tensorflow performance success adviser-220614134333-d76be662fa33319a 26m 29s
ps-nlp-tensorflow-gpu performance failure adviser-220614153715-1fc19007f995c727 26m 9s
ps-nlp-pytorch performance failure adviser-220614150618-eba5b4e3183fd0f2 26m 14s

ps-cv

overlay type result advise ID time
ps-cv-ocr latest success adviser-220613144932-7c569b4d4585fd54 22s
ps-cv-tensorflow latest success adviser-220613145241-6d5d7bdc27ac3a3a 1m 18s
ps-cv-pytorch latest success adviser-220613145031-fcb0a951d8adb577 1m 44s
ps-cv-ocr stable success adviser-220613184700-1f90f53ccf4159b1 2m 24s
ps-cv-tensorflow stable failure adviser-220613162944-6d3bf6b86a373e6d 22m 49s
ps-cv-pytorch stable failure adviser-220613180701-d348b4ef9c9b3e87 26m 17s
ps-cv-ocr performance success adviser-220613185358-fb8309ed55dd32d9 2m 7s
ps-cv-tensorflow performance failure adviser-220613210546-7078358b1944fd96 26m 8s
ps-cv-pytorch performance failure adviser-220613192843-85146a8c21afc8ee 27m 3s
ps-cv-ocr security success adviser-220614183752-2991f946845d4af 27s
ps-cv-tensorflow security failure adviser-220614180129-becf1431eab85efe 27m 46s
ps-cv-pytorch security failure adviser-220614183855-da15ed8545b4868 26m 13s

ps-ip

overlay type result advise ID time
ps-ip-ifd latest success adviser-220613145447-b8f73428af85d2bc 31s
ps-ip-ifd stable success adviser-220613160047-b5ae72918b1150b7 20m 56s
ps-ip-ifd performance success adviser-220613185732-2c9fc59a216df36a 23m 17s
ps-ip-ifd security success adviser-220614174631-d062f02ad3a261bf 54s
codificat commented 2 years ago

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can /close this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. https://github.com/thoth-station/integration-tests/issues/324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

sesheta commented 2 years ago

@codificat: Closing this issue.

In response to [this](https://github.com/thoth-station/core/issues/326#issuecomment-1161778695): >> Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the `latest` recommendation type (what is currently configured in `.thoth.yaml`). > >Based on this, I believe we can >/close >this one as complete. > >We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. https://github.com/thoth-station/integration-tests/issues/324), and possibly review the justification related to the failures on some combination of stack/type. > >These are tracked in separate issues as appropriate. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.