File lock contention fails on large files - lock is held while events propagate.

opral / inlang-message-sdk

0 stars 0 forks source link

File lock contention fails on large files - lock is held while events propagate. #54

Open jldec opened 1 month ago

jldec commented 1 month ago

MESDK-92

Context

Observed with 10k load test and then running machine translate in another process

The watcher (which doesn't write) grabbed the lock for so long that the cli translate process overruled it after max retries.

Screenshot 2024-05-10 at 08.56.59.png

Proposal

jldec commented 1 month ago

repro'ed again and discussed 1:1 with @martin.lysk1

Contention does not appear to happening during file writes which are faster than the lock timeout. It only happens during the followup loads. This would suggest that there is no danger of file corruption. 👍
During a load, a process with solid effect subscribers will hold the lock while propagating all events e.g. to getAll() subscribers, and this can take a long time.

@martin.lysk1: Question could we release the lock earler - say right after the load has read the file, without waiting for all the events? This would reduce the exposure to lock contention.

martin-lysk commented 1 month ago

As long as reactivityMap is in the core of the SDK - One option here would be to use a solid batch to reduce signal productions during message loading - this should reduce $KEYS and $VALUES trigger in ReactivityMap. If we consider that we should also back to await 0 instead of setTimeout(0…) to avoid a side effect from UI kicking in from macroTasks loop.

martin-lysk commented 1 month ago

As an alternative I would suggest as a next stop towards to MESDK-67 :

Decuple solid and introduce a messageStore that propagates changes using the delagate pattern to the a reactiveMessageQuery which would than serve as our thin layer to ui. This would allow us to remove reactivityMap from messageStore and we could use the delegate to propagate batched changes.

jldec commented 1 month ago

@martin.lysk1, you're right - this lock problem is tightly coupled with the reactivity implementation.

The question is whether we fix it now for the old plugins, or focus on implementing experimental persistence for our v2 message structure first.

I would prefer to do the latter, but I'm open for discussion if anyone disagrees.

martin-lysk commented 1 month ago

The root cause of this particular issue is the missing stale detection of the lock. read performance should be tackled separate ticket. I briefly discussed this with jan and we decided to add the 'ax' flag option to the memoryFS and to the nodishFileSystem type. That allows us to use lockfile instead of lockfolder and we will also be able to detect stale lock files (compare: LIX-63) keeping this in the backlock for now.