Open SkyLined opened 3 weeks ago
Given this is only an issue for JS Scripting, this issue might need to be moved to the add-ons repo unless core is amenable to putting a lock on the shared cache for all languages. A lock on the private cache might be advisable as well given timers. If not, locks will need to be impelmented in the JS Scripting add-on similar to the locks already in place for timers, assuming such locks are feasible.
Note that if you have rules that call other rules, there are also no locks in place in that case so if two rules call the same other rule at the same time the exception will occur and as far as I've been told it is not technically feasible to put locks in place in this context, at least not in the JS Scripting add-on by itself.
@florian-h05, you are most familiar with the locking stuff that was done for JS Scripting so far, wdyt?
As I work-around, I have been trying to create String Items to store values (as JSON) and Groups to simulate arrays of such values. This seems to mostly avoid the issue: I very infrequently encounter this exception now. It doesn't completely resolve it and does make the Items page near impossible to use (as explained in https://github.com/openhab/openhab-webui/issues/2819). If I come up with a better work-around, I'll let you know.
Wrt to the proposed solutions by OP, 2 and 3 are no option. 1 seems like the best of the proposed solutions, however I think any solution is difficult to implement.
Though I wonder if the shared cache does need a synchronisation in general, because even if the script engine allows multi-threaded access to it‘s context, modifying something in the shared cache by two different threads can lead to bad side-effects. But in that case the difficulty would be that the lock for JS Scripting should be the lock of the one who stored the value to the shared cache.
If a solution is only to be implemented in JS Scripting (which however would not synchronise the access from scripts in other languages than JS), I think we could wrap core‘s shared cache and associate the lock of the engine with the value in the shared cache on value store. Then on value retrieval, that lock would need to be locked and the script engine would need to unlock the lock after execution . This way we always have an „owner“ of something in the shared script and synchronise on the lock of that owner.
GraalJS approach to disallow multi-threaded access is sometimes a PITA, but I understand that decision given the side-effects that can occur and given that NodeJS is single-threaded as well.
A lock on the private cache might be advisable as well given timers.
I don‘t think that is needed, as the values in the private cache can only be accessed by their „owner“ engine. As timer execution is synchronised in the engine, it is not possible for a timer to execute while the script is active somewhere else.
Though I wonder if the shared cache does need a synchronisation in general, because even if the script engine allows multi-threaded access to it‘s context, modifying something in the shared cache by two different threads can lead to bad side-effects.
I assumed calls to put
would already be synchronized at least. If it's not I agree that's a problem above and beyond GraalVM's restrictions.
it is not possible for a timer to execute while the script is active somewhere else. :+1: I wasn't thinking that through. Of course there's no problem with private cache given the locks already in place for timers.
I assumed calls to put would already be synchronized at least.
They are, just checked the core code :+1:
@J-N-K May I ask for your help here?
The best solution I can currently think of is that every value in the shared cache has an owner, that can provide a lock which has to be acquired if another script is accessing the value. The lock has to be unlocked after the execution of that other script. However this poses the risk that scripts block others from execution for a possibly long time.
this poses the risk that scripts block others from execution for a possibly long time.
It sounds like that is going to be a solution that will cause unexpected and unexplained (e.g. hard to debug) freezes in rules. For instance, I have a rule that handles calendar events. The rule uses timers to do things a number of minutes before/after an event starts/ends. These timers are set as soon as the event is seen (i.e. one of the next two events on the calendar). This script may therefore set these timers hours or even days before they fire. This rule accesses values from the shared cache before setting these timers, so it will block access to those values pretty much continuously. If anyone has a similar rule, they will find themselves with a completely broken system when this fix is released. I'm sure that's not ideal.
This rule accesses values from the shared cache before setting these timers, so it will block access to those values pretty much continuously.
Wrong. The shared cache is only blocked while the script engine is executing code. In case of timers, you run your script and create the timer in that run. The timer is then scheduled, and once the timer expires the callback is run by the same engine. In between, this engine can do other stuff and if it is not doing anything, the stuff in the shared cache it owns can be accessed by others.
I don't fully understand the issue. The shared caches uses locks on put/get/remove, so where do you see a problem here?
You have two rules, both having their own GraalJS script engine. Script A puts a JS object into the shared cache and does some heavy calculations. At the same time, script B gets that JS object from the shared cache.
The problem here is not that the cache itself does not synchronize its operations, but that script B accesses an object that is owned by script A while script A is currently running and this is a illegal access to script A's context from script B's thread. This problem can be fixed by having script B wait with accessing the object until script A is not running anymore, but the script B has to use the lock of the engine of script A for that.
IMO that's wrong usage of the cache. You should store unmodifiable objects there, and instead of modifying an existing object create a modified copy and put that to the cache.
Indeed: I can get a reference to a single JS object instance in two rules through the shared cache. These rules can run in separate threads and JavaScript engines are designed to run single-threaded. Using the cache to access properties of a single object shared between two rules at the same time appears to result in exceptions.
The example code I created writes to the object but the problem should exist if it reads properties as well. I will run a quick test to make sure.
I've updated the repro code to only read a property and not modify the object:
step = "try {...}";
try {
step = "cache.shared.get(...)"
let object = cache.shared.get('object');
if (object === null) {
object = { property: 1 };
step = "cache.shared.put(...)"
cache.shared.put('object', object);
}
// repeatedly access the shared object for half a second to create a large time window
// in which two threads are likely to attempt to access the shared object.
let counter = 0;
for (const end_time = new Date().valueOf() + 500; new Date().valueOf() < end_time;) {
step = "read object.property";
counter += object.property;
};
} catch (exception) {
// Log in which step the exception happened and what event we were passed (timer or manual).
console.log(`Error in step ${step} for event ${event}`);
throw exception;
};
Here's a sample of the errors I'm getting after starting Test1
to run every second and manually running Test2
:
2024-10-24 16:43:51.454 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(null)
2024-10-24 16:43:56.451 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(null)
2024-10-24 16:43:57.450 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(Multi threaded access requested by thread Thread[OH-rule-Test1-1,5,main] but is not allowed for language(s) js.)
2024-10-24 16:43:59.481 [INFO ] [g.openhab.automation.script.ui.Test2] - Error in step cache.shared.get(...) for event Execution triggered by manual
2024-10-24 16:43:59.488 [ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'Test2' failed: java.lang.IllegalStateException: Multi threaded access requested by thread Thread[qtp1338110219-1366,5,main] but is not allowed for language(s) js.
Note that sometimes I can catch an exception and sometimes I get the "Failed to execute rule..." error.
I think the issue really is that you try to use an object that itself does not support multi-thread access. It's not an issue of the cache. Try something simple, put an integer to the cache, start with 0, increase after every read and put the new number to the cache. This should work perfectly fine.
I think the issue really is that you try to use an object that itself does not support multi-thread access.
That's the problem with objects in Graal ... primitives should work perfectly fine. You should be able to store objects and arrays by serialising to JSON and storing the string, later parse the JSON string.
It's not an issue of the cache
:+1:
Sure that would work, but I'm trying to allow two rules to communicate with more than just integers. I've got a number of rules that want to play messages through audio, these store a string that refers to the mp3 file to be played in a shared array. A single rule control the audio player and removes the top entry from the array to play, and the the next, etc. I cannot implement that with a shared cache that only allows integers.
Note that my original repro is based on an example from the documentation, which uses an object with a property that serves as a counter. This suggests that this is a totally normal thing to do and should not result in exceptions. See here: https://www.openhab.org/addons/automation/jsscripting/#cache
You should be able to store objects and arrays by serialising to JSON and storing the string, later parse the JSON string.
This will result in issues. One rule will read the string, parse it to an array, add an mp3 file, stringify and set. Meanwhile another rule will do the same. Both get the same initial string from the cache, add a string, and write to the cache. The last one will overwrite the changes made by the first one.
Now that I know what you want to achieve, try that out:
Have one String Item. All rules that want to queue an mp3 file should sent the filename to this Item. Use this Item as trigger for the audio player rule: If the audio player rule is triggered by that Item, add the received command to the array, which is internally stored by the audio player, either in private cache or if you use file-based scripting, a global variable.
I had that setup initially but dropped it for some reason - I don't recall what it was. I can try it again and find out why that didn't work for me.
But if the shared cache should never contain anything other than primitives, that should be documented and it should throw an error when you try to do so IMHO.
But if the shared cache should never contain anything other than primitives, that should be documented and it should throw an error when you try to do so IMHO.
Sure - the docs are only that way because no one noticed this problem before. I will take care of documentation and to log a warning when putting an object into shared cache.
Have one String Item. All rules that want to queue an mp3 file should sent the filename to this Item...
This is almost exactly the use case I had when I created the Separation of Behavior and Gatekeeper design patterns and Gatekeeper class in openhab_rules_toole.
In either case you'd only need the private cache, not the shared cache, so there won't be any multithreaded access problems. However, note that you will want to use a String Item for this and not calling the rule from another rule becuase that latter approach is not locked so if two rules attempt to call the same rule at the same time the multithreaded exception will occur.
I still think having locks on the Objects in the shared cache is still a good idea because even if one used immutable Objects, the problem @SkyLined points out could still occur where two rules (regardless of the language) start at near the same time, each grab the value from the cache at near the same time, one modifies it and puts the modifications back into the cache. The other one modifies their now stale copy of the Object and puts that back into the cache, wiping out the changes from the first.
This remains a potential problem regardless of the rules language.
However, note that you will want to use a String Item for this and not calling the rule from another rule becuase that latter approach is not locked so if two rules attempt to call the same rule at the same time the multithreaded exception will occur.
I think that there won't be a multithreaded exception, just tried the following: Have a rule A that has a JS script action with a sleep of 10 seconds to ensure the rule runs long enough. Have a rule B that attemps to run rule A and pass some information. Running B while A is running does not lead to a multithreaded exception but instead to:
17:38:56.055 [ERROR] [re.automation.internal.RuleEngineImpl] - Failed to execute rule ‘A' with status 'RUNNING'
I routinely see Multi-threaded exceptions in my config. Something is not being properly locked in the combination of rules running which create timers which call another rule. Assuming that rule locking and timer locking are working all that's left is two rules calling the same other rule at the same time.
But I don't think that test case actually tests what I'm talking about. My internet is down right now so I can't test this just yet, but I'd test it as follows:
Create a script or rule A with a JS Action and make sure that the script also has a variable or two. Put the sleep in that for sure at the end to make sure it ruins long enough but I don't know that the sleep is necessary. But now create rules B and C with a cron trigger and JS action that calls rule A at the same time with passed values, at least one of which is an Object (I don't know it that matters or not, primites may also cause the error).
You need three rules to test the behavior I'm talking about because I don't think it has to do with the fact that rule A is already running as much as rule B and C are trying to call it at the same time. IIRC you've said that rules.runRule
bypasses the normal rule locking so this is what I've figure is the root cause
Or maybe some recent changes have changed this behavior and the engine now generates a new error? Im on milestone releases. I wish I could get to my logs right now.
I'll try to remember to come back once I have internet again.
I think I remember why I stopped using a single string Item to queue files and a rule that triggers when it's set: The rule needed a way to play files one after the other when two rules set the Item around the same time. For that I created a queue in a local variable, which did not cause problems, and started a timer to play the next file in the queue when the previous one was done. I experienced a lot of exceptions with these timers, which would break this setup. I attempted to work around that by creating a rule that runs every second in place of the timer, and use the shared cache to allow both rules to access the queue.
I'll revert back to using timers and see if I can pinpoint all multi-threaded access issues I get with them. Fixing that would make life a lot easier for me, and I expect everyone that uses them.
@rkoshak
With the introduction of script pre-compilation, a regression wrt to multi-thread access synchronization has been introduced for UI-based scripts, which has been fixed just yesterday: https://github.com/openhab/openhab-addons/pull/17510 This PR also fixes a multi-thread access synchronization issue with UI-based scripts that occured when a timer was waiting for a script execution to finish to take over the context. The cause of this issue is unknown - the debugger proved the lock was properly acquired and unlocked and during debugging the issue was gone. I suspect that there is some timing issue in the milliseconds inside the JVM, and I have been able to reproducible fix it on my dev and prod systems with https://github.com/openhab/openhab-addons/pull/17510. If you update to the latest milestone, you can use the script from https://github.com/openhab/openhab-addons/pull/17510's description to verify the fix on your machine.
@SkyLined
If you have used UI-based rules for your first attempt, the above could apply to your problems. If you want to be sure that there are no multi-threading issues, use file-based rules or at least use a file-based rule for the audio player - my production system uses a huge number of timers and I have never experienced any multi-threading problem with my 100% file-based rules.
the debugger proved the lock was properly acquired and unlocked and during debugging the issue was gone.
Gotta love Heisenbugs. I'll see about updating to the snapshot this weekend and see if my periodic multithreaded exceptions go away.
The rule needed a way to play files one after the other when two rules set the Item around the same time. Again, this is exactly why I created Gatekeeper. It does the queueuing and timer management.
They way I've set this up in the past was something along these lines.
I trigger the rule based on command to a String Item. The code for the Action is along these lines:
var {Gatekeeper} = require("openhab_rules_tools");
var gk = cache.private.get("gatekeeper", () => Gatekeeper());
// Calculate how long after this command runs to wait before the next one can run
var timeout = <some calculation, maybe based on the number of characters in event.itemCommand>;
gk.addCommand(timeout, () => {
// what ever you want to do with this command, e.g. tts
});
The gatekeeper is in the private cache and it manages a queue of commands and a Timer to run a command (a passed in function) and wait the timeout time minus how ever long it took to run the command function before running the next command on the queue.
I don't think I ever experienced a multithreaded exception from Gatekeeper in my setup, so this approach might work around the regression that @florian-h05 just fixed too. Though it could also just be a fluke of timing.
Expected Behavior
The
cache.shared
object is documented as allowing any script to access it and "share" information. The documentation comes with an example forcache.private
that we can adjust to usecache.shared
:We would expect the above code to work when used by two rules; every time either of the rules fires, the code is executed, the counter increased and the old value logged.
Current Behavior
cache.shared
allows sharing JavaScript objects between two rules. JavaScript code for two different rules will get executed in two different threads. The JavaScript engine used by OpenHAB is not designed to be run in multiple threads simultaneously. The Java code is aware of this and will throw ajava.lang.IllegalStateException
exception when two threads attempt to access the same JavaScript object. The message associate with this exception is:This exception unexpectedly stop the execution of the rule using
cache.shared
and breaks its functionality.Possible Solution
cache.shared
to share information between multiple rules. Obviously, limiting the amount of sharing seriously reduces the usefulness ofcache.shared
, so this is more of a stop-gap than a solution IMHO.Steps to Reproduce (for Bugs)
This issue quite easy to reproduce If we create two JavaScript rules that accesses a shared object over a prolonged period and execute them at nearly the same time. In this set-up, the JavaScript code is almost guaranteed to simultaneously attempt to access the object, which triggers the exception. Here's a step by step guide to do this:
Test1
andTest2
Test1
that fires every second (cron:* * * * * ? *
)Test2
does not need a trigger: you will run it manually later.Test1
andTest2
to execute the same JavaScript code (found below).Test1
(the cron rule)Test2
Here is the code used in both rules:
Here is an example log output showing the exception after starting the first rule and manually running the second:
The exception can happen in either
Test1
orTest2
depending on timing.Context
I am creating a queue for audio files to be played. The queue is stored in an
Array
incache.shared
. Any rule can useArray.push()
to add a file to the queue. A dedicated rule uses a timer to repeatedly check if it is time to play the next file and usesArray.shift()
to get the next file from the queue and send it to the player. This system breaks when two rules try to add or remove files to/from the queue at the same time. These exceptions make the system unreliable.Your Environment