golemfactory / golem-core-python

GNU General Public License v3.0
3 stars 2 forks source link

Fix agreement termination and/or activity destroying #19

Open johny-b opened 1 year ago

johny-b commented 1 year ago

This is hard to replicate. Also, I've seen this only on the public-beta subnet.

I know there is a problem because yacat.py sometimes leaves activities in "stopping" state forever, where state is changed here:

activity_data[activity]['status'] = 'stopping'
await activity.parent.close_all()
activity_data[activity]['status'] = 'Dead [weak worker]'

so, Agreement.close_all() sometimes hangs forever, and I don't know why and where.

Also, there is a wider topic to consider here. We discussed with @prekucki some time ago that agreement termination should just always succeed, provided that the request reached yagna (i.e. yagna should notice our termination request and ensure agreement was terminated asap). This is pretty important - we shouldn't be left with a hanging agreement because of some weird error (e.g. on the provider side). Disclaimer: I don't really know how this is now handled in yagna.

Edit: a little more data in https://github.com/golemfactory/golem-core-python/issues/47 Edit 2: current solution (Agreement.close_all()) is probably not very good - we can be stuck forever in a loop where yagna says

[2023-01-03T10:25:40.435+0100 ERROR ya_activity::error] Activity API server error: GSB error: Remote service at `/net/0x379c1f8c7f55929c7e5c491b08894159b8c96f15/activity/DestroyActivity` error: Bad request: endpoint address not found

in every iteration.

Edit 3: There might not be a perfect solution available here. There is just no way of ensuring that agreement was terminated (or activity destroyed) - provider might not be responding ever.