Closed edmacdonald closed 3 years ago
This is truly bizarre. That error message should be impossible, since Startup is only called from within an exclusive lock in GetLoadedDict.
The only entrance to that code is here: https://github.com/imazen/imageflow-dotnet-server/blob/5b5d32fbb6e8532915bbf1ef9ddfc5f9b4b90d31/src/Imazen.HybridCache/MetaStore/Shard.cs#L30-L40
And the lock primitive code is here: https://github.com/imazen/imageflow-dotnet-server/blob/main/src/Imazen.Common/Concurrency/BasicAsyncLock.cs
It's pretty simple code and I can't see any code path that would permit that error to occur. Can you?
BasicAsyncLock is identical to what is described here: https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-6-asynclock/
BasicAsyncLock is identical to what is described here: https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-6-asynclock/
Actually it's not. Their AsyncLock is based on lock(), not SemaphoreSlim. https://devblogs.microsoft.com/pfxteam/building-async-coordination-primitives-part-5-asyncsemaphore/
I suppose I could rebuild on top of AsyncSemaphore from https://github.com/microsoft/vs-threading/blob/main/src/Microsoft.VisualStudio.Threading/AsyncSemaphore.cs
... but the implementation I'm using seems very popular and I'd like to understand why it doesn't work.
I have one instance that has been up since March 5th. It started seeing these errors 3am on March 9th, continued for about 24 hours and then suddenly stopped. I suspect this is happening when the cache reaches the 16GB limit I configured, and the sudden stop may be temporary due to some clean up having happened.
Scanning the logs, it looks like every error was preceded by a message indicating a cache hit
2021-03-10 02:38:12.831 +00:00 [DBG] Serving from HybridCacheService /remote/aHR0cHM6Ly9tLm1lZGlhLWFtYXpvbi5jb20vaW1hZ2VzL0kvNTF3ckZBRmtrT0wuanBn.scri_ibCwpc.jpg?maxwidth=330&mode=max&scale=downscaleonly
And every cache hit appears to be followed by the error. Conversely, all cache-misses seemed to work fine.
That code path should be called for every async write (and sync writes if configured), regardless of the size of the cache. Until Startup() is called, the app doesn't even know the size of the cache.
It's bizarre that this only started happening after 4 days and then resolved.
What is your CLR version and OS?
What's truly bizarre is that even if the async lock was defective, you should have only gotten a handful of errors, since there's a if (dict != null) return dict;
call prior to the lock. And there's no way that it ran for 4 days without initializing dict
.
What version of Imageflow.Server were you running?
Could you send me your entire log file? You can email it to lilith@imazen.io
So a particularly good piece of Merlot BellaVitano triggered an epiphany. If the Startup() routine persistently throws an exception, it could cause this issue. However, that error should also appear in the logs, and it doesn't explain why the issue didn't occur until day 4. Hopefully the logs will shed some light on this.
Hi Lilith,
Sorry it's taken so long to reply. Here is the info, I'll send you logs shortly, but upon closer examination, it appears that a sudden restart is both the cause and the cure. The server is recycling at 3am, and on the 4th day, that put it into the error state until 3am the next day when the restart cleared the issue.
ImageFlowServer 0.5.6 NetCore 3.1.8 Windows 2019 Datacenter 10.0.17763 N/A Build 17763
Thanks, -Ed
This is fixed in 0.5.8
The logs revealed that the previous error was a file contention problem. A previous process was still locking the files when HybridCache started, preventing it from loading the cache database. This fix allows it to retry even if it fails the first time. Presumably the locks were only temporary and any errors will be intermittent and only affect cache efficiency briefly.
I'm seeing tons of these errors... There is free disk space on the instance, but the cache size limit may have been reached. Any thoughts?
my config...