grafana-toolbox / grafana-wtf

Grep through all Grafana entities in the spirit of git-wtf.
GNU Affero General Public License v3.0
138 stars 12 forks source link

Database is locked in thread ... error #111

Open JensRichnow opened 7 months ago

JensRichnow commented 7 months ago

I'm using v0.17.0. When running grafana-wtf --drop-cache find xyz I get a DB lock errror:

2023-11-22 14:11:14,368 [grafana_wtf.core                    ] INFO   : Fetching dashboards in parallel with 5 concurrent requests                                             | 0/578 [00:00<?, ?it/s]
2023-11-22 14:11:19,952 [requests_cache.backends.sqlite      ] WARNING: Database is locked in thread 6272069632; retrying (1/3)
2023-11-22 14:11:25,210 [requests_cache.backends.sqlite      ] WARNING: Database is locked in thread 6204764160; retrying (1/3)
2023-11-22 14:11:30,478 [requests_cache.backends.sqlite      ] WARNING: Database is locked in thread 6221590528; retrying (1/3)
2023-11-22 14:11:35,742 [requests_cache.backends.sqlite      ] WARNING: Database is locked in thread 6238416896; retrying (1/3)
2023-11-22 14:11:41,009 [requests_cache.backends.sqlite      ] WARNING: Database is locked in thread 6255243264; retrying (1/3)

Running on Apple M2 Max.

How can I recover from that? Cheers

JensRichnow commented 7 months ago

Update on the above:

Segmentation fault: 11                                                                                                                                                 | 27/578 [00:02<00:38, 14.16it/s]
/Users/jens/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
amotl commented 7 months ago

Dear Jens,

thank you for writing in, and apologies for the late reply.

grafana-wtf was conceived as a sloppy tool fulfilling some of our needs and to make use of grafana-client the other day. It looks like we did not get the code right when bringing concurrent HTTP downloads together with the SQLite-based request cache -- it apparently "just worked" for us and others up until now that you are reporting this flaw.

We will have to look into it where corresponding locks would be missing, that will prevent accessing shared resources concurrently. In the meanwhile, you may be successful by using the --concurrency= command-line option, in order to decrease download concurrency? Please let us know how this goes.

With kind regards, Andreas.

amotl commented 7 months ago

There are a few discussions on the issue tracker of the requests-cache package about this topic.

In general, they indicate the SQLite-based caching has limits when it comes to high concurrency situations, but there are also specific recommendations which may need to be applied to the grafana-wtf code base. If you have the capacity to dive into the details, we will be more than happy.

Now that it trips at a concurrency of "5" already, indicates that either the code of grafana-wtf would need to be improved, as mentioned above, or that you just discovered something in this area which will trip on M2 machines much more likely, even with lower concurrency. In this case, it may be relevant to report this to the requests-cache developers as well.

Another recommendation I've picked up from quickly scanning those discussions would be to use a different cache backend. Please let us know if you have such needs, we think it will be easy to prepare grafana-wtf to offer different cache backends on behalf of requests-cache.

amotl commented 7 months ago

There is another option we may want to consider.

grafana-wtf --drop-cache

Some people are also using that option in order to emulate a --no-cache behavior.

Maybe it is also the right time to finally offer such an option, so that grafana-wtf can also be more easily used as a library, without doing too much like making the request caching obligatory.

In this spirit, when implemented, --no-cache should turn off any caching at all, eventually resolving your issue in a different manner.

amotl commented 7 months ago

Hi again,

taking all this into consideration, specifically @JWCook's comment at https://github.com/requests-cache/requests-cache/issues/870#issuecomment-1716370826, ...

The solution is to use CachedSession instead, which is thread-safe.

... and reviewing grafana-wtf's code at the spots where the cache enablement and the concurrent downloads are happening, we can see it does not use CachedSession at all.

So, we figure a good first measure would be to actually use request-cache's CachedSession, which may improve the situation right away, when we are lucky?

With kind regards, Andreas.

JWCook commented 7 months ago

It looks like the actual requests are sent by the grafana-client library, so I'm guessing that's why install_cache() is being used here. That's a convenient way to patch an external library you don't control, but as you found, it's not thread-safe.

I don't think you need to switch backends; SQLite should be more than able to handle this. For reference, though, there are some recommendations on choosing a different backend here.

Thankfully, there is a requests.Session object stored on GrafanaClient.s. You can replace that with a requests_cache.CachedSession object, which would look something like this:

api = GrafanaApi(...)
session = CachedSession(cache_name=__appname__, expire_after=expire_after, use_cache_dir=True)
session.headers['User-Agent'] = api.client.user_agent
api.client.s = session

Good luck! Feel free to open an issue or discussion on requests-cache if you get stuck.

JensRichnow commented 7 months ago

Thanks @amotl ! Makes perfectly sense and setting--concurrency=1 for the time being worked nicely!

amotl commented 7 months ago

Hi there,

thank you both for your excellent replies. Jordan, the suggestion to patch GrafanaClient accordingly sounds good, and at the same time, I am happy to hear that Jens has been able to use --concurrency=1 as a workaround.

Cheers, Andreas.

amotl commented 3 months ago

Hi again. There is a patch now, which may improve the situation.

amotl commented 3 months ago

Dear Jens and Jordan,

grafana-wtf 0.19.0 has been released, including corresponding patches to improve the situation. Can you tell us if it works better for you now? Thanks for your support!

With kind regards, Andreas.

edgarasg commented 2 months ago

Issue still persists with 0.19.0

amotl commented 2 months ago

Dear Edgaras,

thank you for writing in. Do you think grafana-wtf should start providing an alternative cache backend, which is less problematic in concurrent access situations?

@JWCook and @Ousret: Can you spot any obvious problems with the current implementation after improving based on your feedback? Maybe we are still doing things unfortunate or even completely wrong?

With kind regards, Andreas.

JWCook commented 2 months ago

@amotl I looked over your changes and they look good to me. This issue could potentially be fixable with some backend settings, for example SQLite write-ahead logging. This comes with a few tradeoffs, but notably it allows read operations to not block writes. This can be enabled by passing wal=True to CachedSession.

It's also worth mentioning that I can't make any guarantees about niquests' compatibility with requests-cache. From a quick look at its Session class, it should behave the same as requests.Session, but it may or may not remain that way.

I can help debug further, but I'll need some more information. @edgarasg can you provide the exact error message you're getting, and post a complete example that reproduces this issue? Are you getting exceptions, or just log warnings? And @amotl, roughly how many threads are making concurrent requests? Is there a user-configurable thread limit, automatic based on the CPU, or other?

amotl commented 2 months ago

Dear Jordan,

thanks for your swift reply. We used your suggestion about wal=True right away.

grafana-wtf 0.19.1 includes this improvement. May we ask you to use that version for proceeding into further tests, @edgarasg?

It's also worth mentioning that I can't make any guarantees about niquests' compatibility with requests-cache. From a quick look at its Session class, it should behave the same as requests.Session, but it may or may not remain that way.

This is right on the spot. There could be something fishy, but up until further notice, I am also assuming "it will just work". Do you have any idea why that could go south in concurrency situations, because of unfortunate side effects because Niquests is async, @Ousret?

I can help debug further, but I'll need some more information.

Actually, I didn't mean to bother you too much with that issue, so I appreciate your reply and offer very much. Let's hear back what @edgarasg might report about 0.19.1 before conducting another round of investigations.

How many threads are making concurrent requests? Is there a user-configurable thread limit, automatic based on the CPU, or other?

The default value for the concurrency setting is 5. It can be adjusted by using the --concurrency command-line argument.

https://github.com/panodata/grafana-wtf/blob/c28d961d04e2e81219770d0220ac902690fd6465/grafana_wtf/commands.py#L55

With kind regards, Andreas.

Ousret commented 2 months ago

It's also worth mentioning that I can't make any guarantees about niquests' compatibility with requests-cache. From a quick look at its Session class, it should behave the same as requests.Session, but it may or may not remain that way.

Niquests advertise itself as a drop in replacement. It would lost his purpose if this was "broken" or if it diverged too much. We are making sure it would work.

This is right on the spot. There could be something fishy, but up until further notice, I am also assuming "it will just work". Do you have any idea why that could go south in concurrency situations, because of unfortunate side effects because Niquests is async, @Ousret?

Niquests is thread safe in a synchronous enclosure and task safe in asynchronous. So nothing to be concerned there. I believe you don't do async in grafana wtf. So the "thread safe" aspect apply. And yes, it should "just work". We keep close contact in case there's a need to improve.

regards,

edgarasg commented 2 months ago

Thanks for swift response. Now I'm getting such error:

❯ grafana-wtf find Phj18H0nk --drop-cache
[grafana_wtf.core                    ] INFO    : Fetching dashboards in parallel with 5 concurrent requests                                                           | 0/479 [00:00<?, ?it/s]
[1]    9213 segmentation fault  grafana-wtf find Phj18H0nk --drop-cache
/Users/edgarasg/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
amotl commented 2 months ago

Sorry to hear, and thanks for reporting. Please also use --concurrency=1 as a workaround, until this problem will be resolved properly.