Open jthiels opened 6 months ago
Hey,
this is issue is likely cause by sharing the refresh_token
of a login among multiple SIRIUS instances.
When you login in SIRIUS it stores a sol called refresh_token
on your system and keeps the a so called acces_token
in memory.
Acces_tokens
have a short life time and are used to authorize your queries to our application server (e.g. predicting fingerprints). The refresh_token
are long living and are used to request a refreshed acces_token
from our login server. Further the refresh_token
is single use, so when it is used create a new acces_token
it also gets replaced by a new refresh_tokens
. This is important to prevent misuse in case a long living refresh_token
gets stolen. In case a refresh_tokens
is used a second time the whole "token chain" becomes invalid and the user has to re-login using username and password.
I assume that your compute nodes share the same user home directory. Per default the token is stored in the SIRIUS config directory in the user home directory (e.g. /home/USERNAME/.sirius-5.8/
). If now multiple SIRIUS instances use the same config directory, it happens that a refresh_token
is used twice and the tokens become invalid.
You can solve this by using a separate "config directory" on each node (or more precisely for each SIRIUS instance running). This can be achieved via the command line parameter --workspace
.
In case you want to automate the login per instance without the risk to leak your credentials in some console logs you can use login via environment variables. In that case you can provide the name of the environment variables where the credentials are stored instead of the actual credentials.
E.g. sirius login --password-env MY_PW_VARIABLE --user-env MY_USER_VARIABLE
Regarding the login problem, I assume that you IP or account got temporarily banned due to too many failing token requests. If the problem still persists please send me an email with the affected username (email address).
Hi, seeing inconsistent login issues where the SIRIUS session stops recognizing the login session after about 40 small-mass jobs, the rest of them fail. We are also trying to submit within the same session across different nodes and have both specified cores in the SIRIUS command as well as adding a 'sleep' line to see if providing a quick break to the server prevents possible collisions. Each job runs on 36 cores (the total number of cores on a single node).
Currently the user I'm working with cannot log in at all (after logging in today successfully before). The login is failing repeatedly both in the GUI and in the command line.
We were using the 5.8.5 version through a conda environment and also the 5.8.6-snapshot binary.
Is the server down or having other issues? As the login has previously worked today and some jobs have run successfully, we're not sure how to troubleshoot from here further.