nextcloud / recognize

👁 👂 Smart media tagging for Nextcloud: recognizes faces, objects, landscapes, music genres
https://apps.nextcloud.com/apps/recognize
GNU Affero General Public License v3.0
558 stars 46 forks source link

Background jobs stuck for many hours after database goes down #1042

Closed phirestalker closed 11 months ago

phirestalker commented 11 months ago

Which version of recognize are you using?

5.0.3

Enabled Modes

Object recognition, Face recognition

TensorFlow mode

WASM mode

Downstream App

Memories App

Which Nextcloud version do you have installed?

27.1.4

Which Operating system do you have installed?

Ubuntu 22.04

Which database are you running Nextcloud on?

MariaDB 10.5

Which Docker container are you using to run Nextcloud? (if applicable)

27.1.4

How much RAM does your server have?

32GiB

What processor Architecture does your CPU have?

x86_64

Describe the Bug

I take down all my containers with a script so that I can do a nightly backup. I just use a list of strings and I want the databases to come up first, because of this, the database gets taken down before Nextcloud. It seems like after the database gets taken down that recognize doesn't try again for many hours after initially retrying 3 times within a few minutes.

This is odd since Nextcloud should have some kind of initialization when it starts that would kick recognize back into gear.

Screenshot 2023-11-29 at 8 20 20 AM Screenshot 2023-11-29 at 8 23 05 AM

Expected Behavior

Recognize should check every 5 minutes for database access after a long failure.

To Reproduce

Start your database container. Start Nextcloud. Start a reindex and wait for background jobs to be scheduled. Take down the database. Wait for and hour. Start the database again.

Debug log


Warning | recognize | OC\DB\Exceptions\DbalException: Failed to connect to the database: An exception occurred in the driver: SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo for db failed: Name or service not known |   | 10 hours ago
-- | -- | -- | -- | --
Warning | recognize | OC\DB\Exceptions\DbalException: Failed to connect to the database: An exception occurred in the driver: SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo for db failed: Name or service not known |   | 10 hours ago
Warning | recognize | OC\DB\Exceptions\DbalException: An exception occurred while executing a query: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away |   | 10 hours ago
github-actions[bot] commented 11 months ago

Hello :wave:

Thank you for taking the time to open this issue with recognize. I know it's frustrating when software causes problems. You have made the right choice to come here and open an issue to make sure your problem gets looked at and if possible solved. I try to answer all issues and if possible fix all bugs here, but it sometimes takes a while until I get to it. Until then, please be patient. Note also that GitHub is a place where people meet to make software better together. Nobody here is under any obligation to help you, solve your problems or deliver on any expectations or demands you may have, but if enough people come together we can collaborate to make this software better. For everyone. Thus, if you can, you could also look at other issues to see whether you can help other people with your knowledge and experience. If you have coding experience it would also be awesome if you could step up to dive into the code and try to fix the odd bug yourself. Everyone will be thankful for extra helping hands! One last word: If you feel, at any point, like you need to vent, this is not the place for it; you can go to the forum, to twitter or somewhere else. But this is a technical issue tracker, so please make sure to focus on the tech and keep your opinions to yourself. (Also see our Code of Conduct. Really.)

I look forward to working with you on this issue Cheers :blue_heart:

marcelklehr commented 11 months ago

Hello @phirestalker

This is a known issue. The background is that Nextcloud's core background job management defers jobs for 12h if they fail. There is currently nothing we can do in recognize to rectify that.

There is a discussion about this over here already: https://github.com/nextcloud/recognize/discussions/821