oVirt / ovirt-imageio

The oVirt image transfer daemon and clients.
GNU General Public License v2.0
19 stars 24 forks source link

Attribute error during timeout cancelling transfer #208

Closed jsattelb closed 8 months ago

jsattelb commented 9 months ago

Recently our backup appliance kept receiving a "status=500" from RemoveImageTicketVDS. I believe the actual call should be a status=409, conflict. Changing the reference to TicketCancelTimeout to TransferCancelTimeout in _internal/handlers/tickets.py seems to produce the desired event.

oVirt Event Log (before):

VDSM <FQDN> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error, error=Server failed to perform the request, check logs\n'

oVirt Event Log (after):

VDSM <FQDN> command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185\n'

File: /var/log/ovirt-imageio/daemon.log (before)

2023-12-26 20:59:06,957 INFO    (Thread-19) [http] OPEN connection=19 client=local
2023-12-26 20:59:06,958 INFO    (Thread-19) [tickets] [local] REMOVE transfer=7ef54af5-bc50-4705-948e-81d7a606ec35
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [auth] Cancelling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [auth] Canceling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35 ongoing operations
2023-12-26 20:59:06,958 DEBUG   (Thread-19) [ops] Cancelling operation <Read size=53687091200 offset=0 done=0 at 0x139773690712528>
2023-12-26 20:59:06,958 INFO    (Thread-19) [auth] Waiting until transfer 7ef54af5-bc50-4705-948e-81d7a606ec35 ongoing operations finish
2023-12-26 21:00:06,958 ERROR   (Thread-19) [http] Server error
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/tickets.py", line 99, in delete
    self.auth.remove(ticket_id)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 413, in remove
    if ticket.cancel(self._config.control.remove_timeout):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 324, in cancel
    raise errors.TransferCancelTimeout(self.transfer_id)
ovirt_imageio._internal.errors.TransferCancelTimeout: Timeout cancelling transfer 7ef54af5-bc50-4705-948e-81d7a606ec35

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/tickets.py", line 100, in delete
    except errors.TicketCancelTimeout as e:
AttributeError: module 'ovirt_imageio._internal.errors' has no attribute 'TicketCancelTimeout'
2023-12-26 21:00:06,961 DEBUG   (Thread-19) [http] Empty request line, client disconnected

File: /var/log/ovirt-imageio/daemon.log (after)

2023-12-26 23:19:51,490 INFO    (Thread-60) [http] OPEN connection=60 client=local
2023-12-26 23:19:51,490 INFO    (Thread-60) [tickets] [local] REMOVE transfer=a6bd976e-85ed-431d-bd4d-f4fe26fd7185
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [auth] Cancelling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [auth] Canceling transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185 ongoing operations
2023-12-26 23:19:51,490 DEBUG   (Thread-60) [ops] Cancelling operation <Read size=25769803776 offset=0 done=0 at 0x140330263172432>
2023-12-26 23:19:51,490 INFO    (Thread-60) [auth] Waiting until transfer a6bd976e-85ed-431d-bd4d-f4fe26fd7185 ongoing operations finish
nirs commented 9 months ago

@jsattelb Thanks for reporting! do you want to try to send a simple PR? This looks like trivial error.

Adding a test to cover this case would be nice, I think we already have tests in this area.

nirs commented 9 months ago

The bug was introduced by 03e2f602ad191da83b7254d7003639c07c9250c2 renaming TicketCancelTimeout to TransferCancelTimeout without replacing all usages of the old name.

humnab commented 9 months ago

After changing TicketCancelTimeout to TransferCancelTimeout I still get an Error when backing up a VM (incremental), but now it's a conflict Error:

command RemoveImageTicketVDS failed: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97\n'

This is the /var/log/ovirt-imageio/daemon.log af the transfer on the Node:

2024-01-12 16:47:56,175 INFO    (Thread-138) [http] OPEN connection=138 client=local
2024-01-12 16:47:56,176 INFO    (Thread-138) [tickets] [local] ADD transfer={'dirty': True, 'ops': ['read'], 'size': 21474836480, 'sparse': False, 'inactivity_timeout': 3600, 'transfer_id': 'b1d9c8fd-ed54-4935-968f-200d8cde3c97', 'uuid': '13e7674a-7171-4df2-ad2d-1a8cf633296a', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/b1d9c8fd-ed54-4935-968f-200d8cde3c97.sock'}
2024-01-12 16:47:56,176 INFO    (Thread-138) [http] CLOSE connection=138 client=local [connection 1 ops, 0.000856 s] [dispatch 1 ops, 0.000394 s]
2024-01-12 16:47:56,281 INFO    (Thread-139) [http] OPEN connection=139 client=::ffff:192.168.63.223
2024-01-12 16:47:56,307 INFO    (Thread-140) [http] OPEN connection=140 client=::ffff:192.168.63.223
2024-01-12 16:47:56,308 INFO    (Thread-140) [extents] [::ffff:192.168.63.223] EXTENTS transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97 context=dirty
2024-01-12 16:47:56,639 INFO    (Thread-141) [http] OPEN connection=141 client=local
2024-01-12 16:47:56,640 INFO    (Thread-141) [http] CLOSE connection=141 client=local [connection 1 ops, 0.000736 s] [dispatch 1 ops, 0.000203 s]
2024-01-12 16:47:57,005 INFO    (Thread-142) [http] OPEN connection=142 client=::ffff:192.168.63.223
2024-01-12 16:47:57,009 INFO    (Thread-143) [http] OPEN connection=143 client=::ffff:192.168.63.223
2024-01-12 16:47:57,013 INFO    (Thread-144) [http] OPEN connection=144 client=::ffff:192.168.63.223
2024-01-12 16:47:58,675 INFO    (Thread-145) [http] OPEN connection=145 client=local
2024-01-12 16:47:58,676 INFO    (Thread-145) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:47:58,676 INFO    (Thread-145) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:48:58,677 ERROR   (Thread-145) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:48:58,678 INFO    (Thread-145) [http] CLOSE connection=145 client=local [connection 1 ops, 60.001998 s] [dispatch 1 ops, 60.001119 s]
2024-01-12 16:49:02,796 INFO    (Thread-146) [http] OPEN connection=146 client=local
2024-01-12 16:49:02,797 INFO    (Thread-146) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:49:02,797 INFO    (Thread-146) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:49:11,292 INFO    (Thread-140) [http] CLOSE connection=140 client=::ffff:192.168.63.223 [connection 1 ops, 74.984123 s] [dispatch 10 ops, 0.035233 s] [extents 1 ops, 0.014602 s] [read 9 ops, 0.017504 s, 3.02 MiB, 172.42 MiB/s] [read.read 9 ops, 0.012131 s, 3.02 MiB, 248.78 MiB/s] [read.write 9 ops, 0.005051 s, 3.02 MiB, 597.50 MiB/s]
2024-01-12 16:49:11,294 INFO    (Thread-142) [http] CLOSE connection=142 client=::ffff:192.168.63.223 [connection 1 ops, 74.288057 s] [dispatch 2 ops, 0.011759 s] [read 2 ops, 0.010456 s, 2.00 MiB, 191.29 MiB/s] [read.read 2 ops, 0.007767 s, 2.00 MiB, 257.51 MiB/s] [read.write 2 ops, 0.002590 s, 2.00 MiB, 772.21 MiB/s]
2024-01-12 16:49:11,294 INFO    (Thread-144) [http] CLOSE connection=144 client=::ffff:192.168.63.223 [connection 1 ops, 74.280532 s] [dispatch 1 ops, 0.008919 s] [read 1 ops, 0.006612 s, 1.00 MiB, 151.24 MiB/s] [read.read 1 ops, 0.005209 s, 1.00 MiB, 191.98 MiB/s] [read.write 1 ops, 0.001357 s, 1.00 MiB, 736.89 MiB/s]
2024-01-12 16:49:11,295 INFO    (Thread-143) [http] CLOSE connection=143 client=::ffff:192.168.63.223 [connection 1 ops, 74.284946 s] [dispatch 2 ops, 0.014787 s] [read 2 ops, 0.013882 s, 2.00 MiB, 144.07 MiB/s] [read.read 2 ops, 0.009852 s, 2.00 MiB, 203.00 MiB/s] [read.write 2 ops, 0.003917 s, 2.00 MiB, 510.60 MiB/s]
2024-01-12 16:50:02,797 ERROR   (Thread-146) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:50:02,799 INFO    (Thread-146) [http] CLOSE connection=146 client=local [connection 1 ops, 60.001764 s] [dispatch 1 ops, 60.000980 s]
2024-01-12 16:50:10,829 INFO    (Thread-147) [http] OPEN connection=147 client=local
2024-01-12 16:50:10,829 INFO    (Thread-147) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
2024-01-12 16:50:10,829 INFO    (Thread-147) [auth] Waiting until transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 ongoing operations finish
2024-01-12 16:50:16,605 ERROR   (Thread-139) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 80, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/images.py", line 146, in get
    raise http.Error(http.FORBIDDEN, str(e)) from None
ovirt_imageio._internal.http.Error: You are not allowed to access this resource: Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled
2024-01-12 16:50:16,607 INFO    (Thread-139) [http] CLOSE connection=139 client=::ffff:192.168.63.223 [connection 1 ops, 140.324541 s] [dispatch 1 ops, 140.320217 s] [read 0 ops, 140.318857 s] [read.read 1 ops, 0.013382 s, 8.00 MiB, 597.81 MiB/s] [read.write 0 ops, 140.305374 s]
2024-01-12 16:50:16,653 INFO    (Thread-147) [auth] Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled
2024-01-12 16:50:16,654 INFO    (Thread-147) [http] CLOSE connection=147 client=local [connection 1 ops, 5.825172 s] [dispatch 1 ops, 5.824329 s]

The problem startet with the last update of our backup software (Veeam), and they say this error does not occure on RHV. Can someone check if https://github.com/oVirt/ovirt-imageio/commit/03e2f602ad191da83b7254d7003639c07c9250c2 is also in the official RHV packages?

jsattelb commented 9 months ago

I still receive multiple status=409 conflict messages throughout the backup process in the oVirt UI, but the backups seem to complete successfully. I haven't seen the backups stuck in a "finalizing" state for nearly three weeks. The response is used to inform the process that there is still an active connection (e.g., backup) in progress and to try again later. That is my understanding of it, at least from https://github.com/oVirt/ovirt-imageio/commit/acf69ee00411e3b1cdd1c13c584759eccb939411.

I also receive the same backtrace referenced in the debug log when the backup request no longer exists. It's not reported in the oVirt UI, so I ignored it for now.

nirs commented 9 months ago

What we see in the log (with the fix) is no error on imageio side. If there is an issue it is engine or in the client code.

  1. ticket was added
2024-01-12 16:47:56,176 INFO    (Thread-138) [tickets] [local] ADD transfer={'dirty': True, 'ops': ['read'], 'size': 21474836480, 'sparse': False, 'inactivity_timeout': 3600, 'transfer_id': 'b1d9c8fd-ed54-4935-968f-200d8cde3c97', 'uuid': '13e7674a-7171-4df2-ad2d-1a8cf633296a', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/b1d9c8fd-ed54-4935-968f-200d8cde3c97.sock'}
  1. Client getting dirty extents (which extents changed since last backup)
2024-01-12 16:47:56,308 INFO    (Thread-140) [extents] [::ffff:192.168.63.223] EXTENTS transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97 context=dirty
  1. Ticket removed 2 seconds later - meaning on engine side the client finished the transfer
2024-01-12 16:47:58,676 INFO    (Thread-145) [tickets] [local] REMOVE transfer=b1d9c8fd-ed54-4935-968f-200d8cde3c97
  1. Timeout removing ending the transfer
2024-01-12 16:48:58,677 ERROR   (Thread-145) [http] Client error: Timeout cancelling transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97

This means that there active client connections, probably blcoked in recv() on the client socket, waiting for the next request. This means that the client did not close all connections to the image before ending the transfer - this is a client error.

  1. Request aborted while sending response
2024-01-12 16:50:16,605 ERROR   (Thread-139) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 796, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 841, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 80, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/handlers/images.py", line 146, in get
    raise http.Error(http.FORBIDDEN, str(e)) from None
ovirt_imageio._internal.http.Error: You are not allowed to access this resource: Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled

This means that the client closed the connection without reading the entire response. I'm not sure how this happens if the backup is successful, but this is again client errror.

  1. Transfer finally removed
2024-01-12 16:50:16,607 INFO    (Thread-139) [http] CLOSE connection=139 client=::ffff:192.168.63.223 [connection 1 ops, 140.324541 s] [dispatch 1 ops, 140.320217 s] [read 0 ops, 140.318857 s] [read.read 1 ops, 0.013382 s, 8.00 MiB, 597.81 MiB/s] [read.write 0 ops, 140.305374 s]
2024-01-12 16:50:16,653 INFO    (Thread-147) [auth] Transfer b1d9c8fd-ed54-4935-968f-200d8cde3c97 was canceled

I don't know what is the client code accessing image, but I guess it does this:

  1. Start transfer
  2. Get extents to find the dirty extents to backup
  3. Open multiple connections for reading the data
  4. Copy the interesting data
  5. End the transfer before closing all connections
  6. Wait until transfer ends
  7. Close all imageio connections

What the client should do is:

  1. Start transfer
  2. Get extents to find the dirty extents to backup
  3. Open multiple connections for reading the data
  4. Copy the interesting data
  5. Close all imageio connections
  6. End the transfer before closing all connections
  7. Wait until transfer ends

With this the ticket will removed immediately and the transfer will finish very quickly.

The error in imageio should be fix but it is not the root cause.

Latest RHV contains the fix that introduced the AttributeError, but maybe Veeam did not test with an older RHV version?

nirs commented 9 months ago

@jsattelb Regarding RHV - the issue was introduced in imageio 2.4.1 released in the last oVirt 4.4.z version.

oVirt 4.5.x is called RHV 4.4 SPy - I'm not sure about the exact numbers. However imageio version is exactly the same in oVirt and RHV. You can ask Veeam which imageio version was tested.

humnab commented 9 months ago

@nirs Thanks for the detailed analysis and explanation! You wrote:

Latest RHV contains the fix that introduced the AttributeError, but maybe Veeam did not test with an older RHV version?

Does that mean that latest offical RHV (4.4 SP1 BU3?) has TransferCancelTimeout in tickets.py and latest oVirt (4.5.5) tickets.py has TicketCancelTimeout?

Veeam Support told me that 4.4 SP1 is required, but 4.4 SP1 without Batch Update is quite old, so if this was fixed / changed in 4.4 SP1 Batch Update 2 or 3, I think the Veeam System Requirements are incorrect and they tested only with a recent Version.

@jsattelb I had stucking and failed backups with first backups after the update to 12.1, but after some days those are gone, without the fix, but they produce this errors and need more time as with 12.0, perhaps a result of the incorrect sequenze they use now.

humnab commented 9 months ago

I compared ovirt_imageio/_internal/handlers/tickets.py in https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm with https://cbs.centos.org/kojifiles/packages/ovirt-imageio/2.5.0/1.el8/src/ovirt-imageio-2.5.0-1.el8.src.rpm they are the same.

humnab commented 9 months ago

I compiled and downgraded a Node to: https://ftp.redhat.com/redhat/linux/ent ... ev.src.rpm](https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm followed by a reboot. I get the same error as before: command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error, error=Server failed to perform the request, check logs\n'

nirs commented 9 months ago

I compiled and downgraded a Node to ... ovirt-imageio-2.4.8-1 ... I get the same error as before: command RemoveImageTicketVDS failed: Image daemon request failed: 'status=500, reason=Internal Server Error.

Since this issue exists since ovirt-imageio 2.4.1 getting "500 Internal Server Error" is not surprising.

humnab commented 9 months ago

Do you know if: https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm is the same version as in the official RHV packages? If yes, Veeam must have have the same errors in their test enviroment, if not, do you have access to the official packages, and if yes, can you look for a difference? Thanks, I found myself no way to get a subscription or trial of RHV.

nirs commented 9 months ago

Do you know if: https://ftp.redhat.com/redhat/linux/enterprise/8Base/en/RHEV/SRPMS/ovirt-imageio-2.4.8-1.el8ev.src.rpm is the same version as in the official RHV packages?

I think what you find in ftp.redhat.com is the official package. But more important, ovirt-imageio from github contains exactly the same code you get in oVirt rpms or RHV rpms. There is no special code added in RHV.