octue / octue-sdk-python

The python SDK for @Octue services and digital twins.
https://octue.com
Other
10 stars 4 forks source link

Service creating and publishing to a topic created for a different version of the service #597

Closed nvn-nil closed 1 year ago

nvn-nil commented 1 year ago

Bug report

What is the current behavior?

Different versions of power loss and specs:

power loss service@0.16.6
    octue: 0.43.5
    wake-service: 0.9.4
        octue: 0.46.2
    calculation status: working
power loss service@0.16.10
    octue: 0.47.0
    wake-service: 0.9.7
        octue: 0.46.2
    calculation status: working until release of 0.16.11 (can be due to another reason but this is the most obvious one)
power loss service@0.16.11
    change: Updated to use poetry, automatically select revision
    octue: 0.47.1
    wake-service: auto (0.9.8 was latest at the time using octue 0.46.2)
    calculation status: doesn't work. Fails with a monitor message schema error 'CannotDetermineSpecification'. See exception block in https://api.windquest.app/admin/db/projects/powerlossquestion/493519e9-e0dd-404e-8702-d38c775312ac/change/

Note: All references to versions starting with 0.16.x are power loss versions and those with 0.9.x are wake service versions if I forget to explicitly mention that.

The last known working state was 0.16.10 before the release of 0.16.11 last night.

Today, I get reports of calculations that were never completed (remain in in-progress forever) and this usually happens when wake service crashes. Checked the CR (cloud run) logs and it was the CannotDetermineSpecification error.

So, I reverted to 0.16.10 (setting it as default service revision in WQ) but retriggered questions were never completed as before. Checked the Cloudrun log and saw a bunch of details = "Resource not found (resource=octue.services.windpioneers.wake-service.0-9-8.answers.uuid)" errors (note: 0.9.8 in the topic). This was odd because 0.16.10 had a hardcoded version of wake service 0.9.7. This was the time I raised this issue in the work chat group.

In the WQ logs, I see that the right version of wake service is being called.

INFO service.py:335 <Service('windpioneers/power-loss-service:0.16.10')> asked a question 'fef885e2-194e-420b-97dc-3792e3d1d28c' to service 'windpioneers/wake-service:0-9-7'.

But the cloud run logs show that it's looking for 0.9.8's answer topics.

[ERROR | google.cloud.pubsub_v1.publisher._batch.thread] [analysis-fef885e2-194e-420b-97dc-3792e3d1d28c] Failed to publish 1 messages.
Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable return callable_(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/grpc/_c
status = StatusCode.NOT_FOUND
details = "Resource not found (resource=octue.services.windpioneers.wake-service.0-9-8.answers.fef885e2-194e-420b-97dc-3792e3d1d28c)."

Either, 0.9.7 is looking for the wrong topic or pl 0.16.10 is calling the wake service 0.9.8 instead of 0.9.7 as it is saying. The latter felt like the more likely reason (at the time) because of the latest service revision feature.

I figured there was some issue with picking the right revision from the service registry so deleted the 0.9.8 service revision registered in WQ (so there were no wake service versions registered). Retriggered a question and it showed the same error message (looking for 0.9.8 topic).

So, I created a service revision 0.9.7 in WQ and set that as the default in case it's getting the latest revision all the time. Retriggered a question but wake service gave the same wrong topic error.

I tried removing the 0.9.8 tag from the released revision on cloud run. Retriggered and got the same error.

At this point, I came to the sad realization that I was in over my head and reverted to version 0.16.6 (0.16.8 and 0.16.9 are broken, 0.16.7 never registered) of power loss. I knew this worked previously and was using an older version of octue without the service registry feature. This works but the service is missing some important updates that were made in newer versions.

I recreated service revision in WQ for wakes 0.9.8 (didn't make this default though) and added the 0.9.8 tag back to the cloud run revision.

thclark commented 1 year ago

Thank you @nvn-nil :)

cortadocodes commented 1 year ago

Are we able to close this now?

thclark commented 1 year ago

Yup, some of the issues that came out of this were fixed immediately, others have been made into issues so closing now.