oda-hub / dispatcher-plugin-nb2workflow

0 stars 1 forks source link

Workflow execution failure during dispatcher PR #72

Open burnout87 opened 12 months ago

burnout87 commented 12 months ago

During the development of https://github.com/oda-hub/dispatcher-app/pull/585 , the following workflow failed:

https://github.com/oda-hub/dispatcher-app/actions/runs/6149352313/job/16685039350

Please feel free to move the issue somewhere else more suitable if needed

volodymyrss commented 12 months ago

its a duplicate of https://github.com/oda-hub/dispatcher-plugin-nb2workflow/issues/72

volodymyrss commented 12 months ago

Actually I keep it open to confirm it's a duplicate

dsavchenko commented 12 months ago

This is not a duplicate, tests for both plugins were failing. The complication is that it's not always reproducible. And also it seems that the causes (or at least the manifestations in the logs) were probably different.

burnout87 commented 11 months ago

I think the same error appeared again:

https://github.com/oda-hub/dispatcher-app/actions/runs/6484236816/job/17607602126

dsavchenko commented 11 months ago

Yes, it's the same error. And I can't reproduce it locally. I see lots of timeouts while polling dispather with oda-api in test_full_stack The dispatcher is functional, though, and replies correctly, but after the request is timed out.

We discussed a bit that we want to profile and reduce latency if possible. But for the time being, as it only appears in pipeline, I propose https://github.com/oda-hub/dispatcher-plugin-nb2workflow/pull/73

burnout87 commented 11 months ago

I re-run the workflow, and this time it completed

dsavchenko commented 11 months ago

Still, let's keep this issue open for some time

burnout87 commented 9 months ago

I think I encountered again the same issue, is it related? A TimeoutError is mentioned.

https://github.com/oda-hub/dispatcher-app/actions/runs/7087471000/job/19287806505?pr=626

dsavchenko commented 9 months ago

That's a different kind of timeout

TimeoutError: The provided start pattern Serving Flask app could not be matched within the specified time interval of 30 seconds

It's related to the live_nb2service fixture which starts nb2service as a separate process via xprocess lib and waits to 'Serving Flask app' in the stdout of it

Didn't we have changes to the nb2workflow which could e.g. affect the verbosity?

This xprocess sometimes used to cause issues when debugging tests locally: unclearly terminated test can leave the process, leading to the impossibility to start another one because of the port being used. But in CI it always worked well. I will investigate further.

dsavchenko commented 9 months ago

It was a transient issue. Not sure what was the cause. I wasn't able to reproduce it locally. Then I restarted the CI job and it passed.

burnout87 commented 9 months ago

ok, it did the same for me

volodymyrss commented 9 months ago

It would be better to have some better process starting/tracking behavior to avoid this. Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

volodymyrss commented 9 months ago

Still, let's keep for tracking open.

dsavchenko commented 9 months ago

Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

Exactly, this mechanism is only used in tests

volodymyrss commented 9 months ago

Though I suspect that this particular issue will not happen in production since if the service is not starting, the pod will be recreated.

Exactly, this mechanism is only used in tests

Well, it might be that the server is not starting for some reason. It is then an issue for nb2workflow itself. Port already used is a common issue indeed, in dispatcher I made some custom xprocess analog which tries to deal with this. I think we might be able to adapt xprocess to behave better, but let's leave it for now.

burnout87 commented 9 months ago

looks like it happened again:

https://github.com/oda-hub/dispatcher-app/actions/runs/7130587275/job/19417305982

volodymyrss commented 9 months ago

looks like it happened again:

https://github.com/oda-hub/dispatcher-app/actions/runs/7130587275/job/19417305982

production was down for some 15min, is something there calling it?

burnout87 commented 9 months ago

Now I noticed it, I noticed some crashes elsewhre