HybridCache failed to flush async write

imazen / imageflow-dotnet-server

A super-fast image server to speed up your site - deploy as a microservice, serverless, or embeddable.

https://docs.imageflow.io

GNU Affero General Public License v3.0

246 stars 33 forks source link

HybridCache failed to flush async write #47

Closed edmacdonald closed 3 years ago

edmacdonald commented 3 years ago

I'm seeing tons of these errors... There is free disk space on the instance, but the cache size limit may have been reached. Any thoughts?

HybridCache failed to flush async write, "System.InvalidOperationException: Startup() can only be called once
   at Imazen.HybridCache.MetaStore.WriteLog.Startup()
   at Imazen.HybridCache.MetaStore.Shard.GetLoadedDict()
   at Imazen.HybridCache.MetaStore.Shard.CreateRecordIfSpace(String relativePath, String contentType, Int64 recordDiskSpace, DateTime createdDate, Int32 accessCountKey, Int64 diskSpaceLimit)
   at Imazen.HybridCache.CleanupManager.TryReserveSpace(CacheEntry cacheEntry, String contentType, Int32 byteCount, Boolean allowEviction, AsyncLockProvider writeLocks, CancellationToken cancellationToken)

my config...

            services.AddImageflowHybridCache(
                new HybridCacheOptions(cacheFolder)
                {
                    // How long after a file is created before it can be deleted
                    MinAgeToDelete = TimeSpan.FromSeconds(10),
                    // How much RAM to use for the write queue before switching to synchronous writes
                    QueueSizeLimitInBytes = 100 * 1000 * 1000,
                    // The maximum size of the cache (16GB)
                    CacheSizeLimitInBytes = 16L * 1024 * 1024 * 1024,
                });

lilith commented 3 years ago

This is truly bizarre. That error message should be impossible, since Startup is only called from within an exclusive lock in GetLoadedDict.

The only entrance to that code is here: https://github.com/imazen/imageflow-dotnet-server/blob/5b5d32fbb6e8532915bbf1ef9ddfc5f9b4b90d31/src/Imazen.HybridCache/MetaStore/Shard.cs#L30-L40

And the lock primitive code is here: https://github.com/imazen/imageflow-dotnet-server/blob/main/src/Imazen.Common/Concurrency/BasicAsyncLock.cs

It's pretty simple code and I can't see any code path that would permit that error to occur. Can you?

lilith commented 3 years ago

BasicAsyncLock is identical to what is described here: https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-6-asynclock/

lilith commented 3 years ago

BasicAsyncLock is identical to what is described here: https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-6-asynclock/

Actually it's not. Their AsyncLock is based on lock(), not SemaphoreSlim. https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-5-asyncsemaphore/

I suppose I could rebuild on top of AsyncSemaphore from https://github.com/microsoft/vs-threading/blob/main/src/Microsoft.VisualStudio.Threading/AsyncSemaphore.cs

... but the implementation I'm using seems very popular and I'd like to understand why it doesn't work.

lilith commented 3 years ago

Ah, I got it from here: https://www.hanselman.com/blog/comparing-two-techniques-in-net-asynchronous-coordination-primitives

edmacdonald commented 3 years ago

I have one instance that has been up since March 5th. It started seeing these errors 3am on March 9th, continued for about 24 hours and then suddenly stopped. I suspect this is happening when the cache reaches the 16GB limit I configured, and the sudden stop may be temporary due to some clean up having happened.

Scanning the logs, it looks like every error was preceded by a message indicating a cache hit

2021-03-10 02:38:12.831 +00:00 [DBG] Serving from HybridCacheService /remote/aHR0cHM6Ly9tLm1lZGlhLWFtYXpvbi5jb20vaW1hZ2VzL0kvNTF3ckZBRmtrT0wuanBn.scri_ibCwpc.jpg?maxwidth=330&mode=max&scale=downscaleonly

And every cache hit appears to be followed by the error. Conversely, all cache-misses seemed to work fine.

lilith commented 3 years ago

That code path should be called for every async write (and sync writes if configured), regardless of the size of the cache. Until Startup() is called, the app doesn't even know the size of the cache.

It's bizarre that this only started happening after 4 days and then resolved.

lilith commented 3 years ago

What is your CLR version and OS?

lilith commented 3 years ago

What's truly bizarre is that even if the async lock was defective, you should have only gotten a handful of errors, since there's a if (dict != null) return dict; call prior to the lock. And there's no way that it ran for 4 days without initializing dict.

lilith commented 3 years ago

What version of Imageflow.Server were you running?

lilith commented 3 years ago

Could you send me your entire log file? You can email it to lilith@imazen.io

lilith commented 3 years ago

So a particularly good piece of Merlot BellaVitano triggered an epiphany. If the Startup() routine persistently throws an exception, it could cause this issue. However, that error should also appear in the logs, and it doesn't explain why the issue didn't occur until day 4. Hopefully the logs will shed some light on this.

edmacdonald commented 3 years ago

Hi Lilith,

Sorry it's taken so long to reply. Here is the info, I'll send you logs shortly, but upon closer examination, it appears that a sudden restart is both the cause and the cure. The server is recycling at 3am, and on the 4th day, that put it into the error state until 3am the next day when the restart cleared the issue.

ImageFlowServer 0.5.6 NetCore 3.1.8 Windows 2019 Datacenter 10.0.17763 N/A Build 17763

Thanks, -Ed

lilith commented 3 years ago

This is fixed in 0.5.8

The logs revealed that the previous error was a file contention problem. A previous process was still locking the files when HybridCache started, preventing it from loading the cache database. This fix allows it to retry even if it fails the first time. Presumably the locks were only temporary and any errors will be intermittent and only affect cache efficiency briefly.