thoth-station / thoth-application

Thoth-Station ArgoCD Applications
GNU General Public License v3.0
12 stars 22 forks source link

[spike][3pt]document is not available for few packages #2604

Closed harshad16 closed 1 year ago

harshad16 commented 2 years ago

Describe the bug Document-sync job is not sycning all the documents.

Acceptance criteria

Additional notes:

codificat commented 2 years ago

Steps to reproduce the behavior:

Also directly via the API, e.g.

$ http 'https://khemenu.thoth-station.ninja/api/v1/python/package/version/metadata?name=pandas&version=1.4.2&index=https%3A%2F%2Fpypi.org%2Fsimple&os_name=fedora&os_version=34&python_version=3.9'
HTTP/1.1 500 INTERNAL SERVER ERROR
access-control-allow-origin: *
content-length: 399
content-type: application/json
date: Thu, 07 Jul 2022 10:45:50 GMT
server: gunicorn
set-cookie: 99770cb82864be05282857f803e02327=71283d09b941169601817f89b67f6781; path=/; HttpOnly; Secure; SameSite=None
x-thoth-search-ui-url: https://thoth-station.ninja/search/
x-thoth-version: 0.35.2
x-user-api-service-version: 0.35.2+messaging.0.16.1.storages.0.72.1.common.0.36.2.python.0.16.10

{
    "error": "Solver document not found - solver documents are not in sync with database records, please contact administrator with the provided information: solver-fedora-34-py39-220403215308-caada45da50e5009",
    "parameters": {
        "index": "https://pypi.org/simple",
        "name": "pandas",
        "os_name": "fedora",
        "os_version": "34",
        "python_version": "3.9",
        "version": "1.4.2"
    }
}
harshad16 commented 2 years ago

/triage accepted /sig devsecops /priority important-soon

VannTen commented 2 years ago

Job is indeed still running, and the previous instance appears to have the same issue (not completed properly) :

$ oc get jobs -n thoth-frontend-stage | grep document-sync
JOB                                       Completed  Duration Age
document-sync-27596160        0/1           34d        34d
document-sync-27619200        1/1           38m        17d
document-sync-27620640        0/1           17d        17d
document-sync-27640800        0/1           3d5h       3d5h

Moreover, the currently running pod had its container SIGKILLed this morning

...
Containers:
  document-sync-job:
    Container ID:   cri-o://ff6bafd4184f07ea754324f90dccb3630fdd1102c18b06d1f3e3a287a4499d0e
    Image:          quay.io/thoth-station/document-sync-job:v0.1.0
    Image ID:       quay.io/thoth-station/document-sync-job@sha256:8661512d23891a5abe4596540dbd85a422166b09f2b4ea87ecc9a53231971ad9
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 25 Jul 2022 05:10:32 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Fri, 22 Jul 2022 05:09:54 +0200
      Finished:     Mon, 25 Jul 2022 05:10:32 +0200
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Liveness:  tcp-socket :80 delay=259200s timeout=1s period=10s #success=1 #failure=1
...

And the only logs lines are all :

$ oc logs document-sync-27640800--1-6cq4w -p --tail=10
2022-07-25 03:10:20,254   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181556-e84a86006fca22f5' is already present
2022-07-25 03:10:21,520   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181557-583d00fd0fea564a' is already present
2022-07-25 03:10:22,820   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181557-7b01e5ca8aacdf9' is already present
2022-07-25 03:10:24,178   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181557-8b6a1a2404d79458' is already present
2022-07-25 03:10:25,471   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181558-20547678769a6ba2' is already present
2022-07-25 03:10:26,753   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181558-58d6308a85a04fa5' is already present
2022-07-25 03:10:28,009   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181558-70676d958fdd91fa' is already present
2022-07-25 03:10:29,274   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181558-cc891b9496ed520c' is already present
2022-07-25 03:10:30,547   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181559-33a40b62136dc964' is already present
2022-07-25 03:10:31,777   1 INFO     thoth.document_sync:71: Document 'solver-fedora-34-py39-211112181559-858a32d827a67580' is already present

(this goes on for 3 days)

So it seems it's busy doing nothing but checking already present documents, getting killed (I'd say OOMkilled given the signal, but needs some metrics) and back to step one.

harshad16 commented 1 year ago

After the work and review , we feel it should be 5pt.