Closed tommyjcarpenter closed 8 months ago
@tommyjcarpenter so it looks like the operator doesn't have a resolver set - can you nsc describe operator
And also attach your resolver configuration in the server.
@aricart what are you looking for in the describe operator
? im trying to not post company stuff in this public thread as much as possible
here are the fields:
here is the redacted server config:
Data
====
nats.conf:
----
# NATS Clients Port
port: 4222
# PID file shared with configuration reloader.
pid_file: "/var/run/nats/nats.pid"
###############
# #
# Monitoring #
# #
###############
http: 8222
server_name: $SERVER_NAME
#####################
# #
# TLS Configuration #
# #
#####################
tls {
cert_file: /etc/nats-certs/clients/nats-client-tls/tls.crt
key_file: /etc/nats-certs/clients/nats-client-tls/tls.key
timeout: 1
}
###################################
# #
# NATS JetStream #
# #
###################################
jetstream {
max_mem: 1Gi
store_dir: /data
max_file:25Gi
}
###################################
# #
# NATS Full Mesh Clustering Setup #
# #
###################################
cluster {
port: 6222
name: [REDACTED]
tls {
cert_file: /etc/nats-certs/cluster/infra-nats-node/tls.crt
key_file: /etc/nats-certs/cluster/infra-nats-node/tls.key
ca_file: /etc/nats-certs/cluster/infra-nats-node/ca.crt
name: [REDACTED]
}
reject_unknown_cluster: true
tls {
cert_file: /etc/nats-certs/gateways/infra-nats-gateway/tls.crt
key_file: /etc/nats-certs/gateways/infra-nats-gateway/tls.key
ca_file: /etc/nats-certs/gateways/infra-nats-gateway/ca.crt
}
# Gateways array here
gateways: [
[REDACTED]
]
}
logtime: true
max_payload: 4MB
lame_duck_grace_period: 10s
lame_duck_duration: 30s
##################
# #
# Authorization #
# #
##################
operator: REDACTED
system_account: REDACTED
resolver: {
type: full
dir: "/accounts/jwt"
allow_delete: true
interval: "2m"
}
resolver_preload: REDACTED
BinaryData
====
E
I wanted your operator to:
Like this:
+----------------------------------------------------------------------------------------+
| Operator Details |
+-----------------------+----------------------------------------------------------------+
| Name | XXX |
| Operator ID | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G |
| Issuer ID | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G |
| Issued | 2024-03-05 19:05:26 UTC |
| Expires | |
| Account JWT Server | nats://localhost:4222 |
| Operator Service URLs | nats://localhost:4222 |
| System Account | ADDGU3EFAQPX3YHAXXJEHHU36FX3VCDTWRYODF5Q52GIT45BZFZW7ZF5 / SYS |
| Require Signing Keys | false |
Your server config looks OK
Just for a bit of sanity:
# Add the operator
/t/s [1]$ nsc add operator O
[ OK ] generated and stored operator key "OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5"
[ OK ] added operator "O"
[ OK ] When running your own nats-server, make sure they run at least version 2.2.0
# create a sys account
/t/s $ nsc add account SYS
[ OK ] generated and stored account key "AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO"
[ OK ] added account "SYS"
# set the sys account on the operator
/t/s $ nsc edit operator --service-url "nats://localhost:4222" --account-jwt-server-url "nats://localhost:4222" --system-account SYS
[ OK ] set account jwt server url to "nats://localhost:4222"
[ OK ] set system account "AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO"
[ OK ] added service url "nats://localhost:4222"
[ OK ] edited operator "O"
# generate the resolver config
/t/s $ nsc generate config --nats-resolver --config-file server.conf
[ OK ] wrote server configuration to `/tmp/s/server.conf`
Success!! - generated `/tmp/s/server.conf`
# start the NATS server
/t/s $ nats-server -c server.conf &
[26391] 2024/03/05 15:17:40.669196 [INF] Starting nats-server
[26391] 2024/03/05 15:17:40.669284 [INF] Version: 2.11.0-dev
[26391] 2024/03/05 15:17:40.669286 [INF] Git: [not set]
[26391] 2024/03/05 15:17:40.669288 [INF] Name: NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH
[26391] 2024/03/05 15:17:40.669291 [INF] ID: NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH
[26391] 2024/03/05 15:17:40.669298 [INF] Using configuration file: server.conf
[26391] 2024/03/05 15:17:40.669300 [INF] Trusted Operators
[26391] 2024/03/05 15:17:40.669302 [INF] System : ""
[26391] 2024/03/05 15:17:40.669304 [INF] Operator: "O"
[26391] 2024/03/05 15:17:40.669306 [INF] Issued : 2024-03-05 15:17:16 -0400 AST
[26391] 2024/03/05 15:17:40.669322 [INF] Expires : Never
[26391] 2024/03/05 15:17:40.669698 [INF] Managing all jwt in exclusive directory /tmp/s/jwt
[26391] 2024/03/05 15:17:40.669909 [INF] Listening for client connections on 0.0.0.0:4222
[26391] 2024/03/05 15:17:40.670176 [INF] Server is ready
# Add an account
/t/s $ nsc add account A
[ OK ] generated and stored account key "ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I"
[ OK ] added account "A"
# Push the account
/t/s $ nsc push
[ OK ] push to nats-server "nats://localhost:4222" using system account "SYS":
[ OK ] push A to nats-server with nats account resolver:
[ OK ] pushed "A" to nats-server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH: jwt updated
[ OK ] pushed to a total of 1 nats-server
# Delete the account
/t/s $ nsc delete account A
[ OK ] expired account "A"
[ OK ] deleted account
[ OK ] deleted account directory
# Push with --prune
/t/s $ nsc push --prune
[26391] 2024/03/05 15:17:59.618908 [ERR] delete accounts request by OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 failed - delete must be enabled in server config
[ERR ] prune nats-server with nats account resolver:
[ OK ] list 2 accounts from nats-server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH:
[ OK ] account AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO named SYS exists
[ OK ] account ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I only exists in server
[ OK ] listed accounts from a total of 1 nats-server
[ERR ] server NAIPUDMVSKC7PTW45E4WT74CC6XAOAWAVVUZKXTCXDWEX6BLP4EVHMDH responded with error: delete accounts request by OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 failed - delete must be enabled in server config
[ERR ] Fewer server responded to 'prune' (0) than to 'list' (1). Accounts may not be completely pruned.
Error: all jobs failed
# Fix the server config to have `allow_delete: true` and restart the server
/t/s [1]$ vim server.conf
/t/s $ killall nats-server
[26391] 2024/03/05 15:18:30.336226 [INF] Initiating Shutdown...
[26391] 2024/03/05 15:18:30.336315 [INF] Server Exiting..
fish: Job 1, 'nats-server -c server.conf &' has ended
/t/s $ nats-server -c server.conf &
[26422] 2024/03/05 15:18:36.367349 [INF] Starting nats-server
[26422] 2024/03/05 15:18:36.367454 [INF] Version: 2.11.0-dev
[26422] 2024/03/05 15:18:36.367457 [INF] Git: [not set]
[26422] 2024/03/05 15:18:36.367458 [INF] Name: NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH
[26422] 2024/03/05 15:18:36.367462 [INF] ID: NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH
[26422] 2024/03/05 15:18:36.367470 [INF] Using configuration file: server.conf
[26422] 2024/03/05 15:18:36.367472 [INF] Trusted Operators
[26422] 2024/03/05 15:18:36.367474 [INF] System : ""
[26422] 2024/03/05 15:18:36.367477 [INF] Operator: "O"
[26422] 2024/03/05 15:18:36.367479 [INF] Issued : 2024-03-05 15:17:16 -0400 AST
[26422] 2024/03/05 15:18:36.367496 [INF] Expires : Never
[26422] 2024/03/05 15:18:36.367885 [INF] Managing all jwt in exclusive directory /tmp/s/jwt
[26422] 2024/03/05 15:18:36.368091 [INF] Listening for client connections on 0.0.0.0:4222
[26422] 2024/03/05 15:18:36.368279 [INF] Server is ready
# Try the prune again
/t/s $ nsc push --prune
[ OK ] prune nats-server with nats account resolver:
[ OK ] list 2 accounts from nats-server NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH:
[ OK ] account AAPMM4G4CYMICDMR5KV2R35KZUPFQ5U6Q2HRLK6J5OP5KBGMYSEWXIIO named SYS exists
[ OK ] account ABGDFGOB2GCGCM6BJVEOAIVMLX4YNHHG2WMPUAV4KTOEQKTBPBRVP36I only exists in server
[ OK ] listed accounts from a total of 1 nats-server
[ OK ] pruned nats-server NDNVR2PCT7RXAWMIQIYCHBVZFQARLPIDTA2I2P2MLTAUVEBU3OXK2YYH: deleted 1 accounts
/t/s $
Not sure what your commands were doing, (the --force
flag is suspect)
So the stack traces you list refers to:
okp, opPk, err := obtainRequestKey(ctx, report)
if err != nil {
report.AddError("Could not obtain Operator key to sign the delete request (err:%v)", err)
return
}
defer okp.Wipe()
The operator key was found or you would have gotten an error. So not quite sure what that is about.
@aricart i showed the two commands that generated stacktraces
I tried (independently):
1 nsc delete account
followed by nsc push -P
-> failed
2 nsc push -R xxxxxx
all i want to do is delete an account from the nats system - but nsc delete account
only does things locally I've gathered, because doing a pull
just puts the account back.
I think nsc
should obv give an error message to the user for whatever the underlying issue is here once we figure it out :)
The --
force was from this nsc
output:
Error: managed accounts may require the account JWT or nkeys to cancel a service, specify the --force to override
As you see in my script it does do the delete.... Can you try what I did locally.
I wanted your operator to:
- have a system account
- have an operator service URL
- have an account service URL
Like this:
+----------------------------------------------------------------------------------------+ | Operator Details | +-----------------------+----------------------------------------------------------------+ | Name | XXX | | Operator ID | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G | | Issuer ID | OCFHAU5YEMSOVK7JXUIHK2QOMCHLSER4X6NNQOF45UFVAUNA7TIBID7G | | Issued | 2024-03-05 19:05:26 UTC | | Expires | | | Account JWT Server | nats://localhost:4222 | | Operator Service URLs | nats://localhost:4222 | | System Account | ADDGU3EFAQPX3YHAXXJEHHU36FX3VCDTWRYODF5Q52GIT45BZFZW7ZF5 / SYS | | Require Signing Keys | false |
We do have the System account (as I showed) but seem to be missing the other two fields:
How would I know what to set these to?
@aricart the plot thickens
so my boss, using the exact same operator, and without the account service urls
, was able to remove an account both ways; with push -R
and delete account ..; push --prune
We are on the same binary version 2.8.5
However, he is on an Apple M1 and I am on an Apple M3... I hope it's not that, but I have had issues with architecture versions like this before, so just wanted to put that out there.
A stacktrace on a different architecture vs another user in our same company working..
Perhaps you tried this already - but I want to make sure we try a new environment.
Before anything, please backup your operator and keys:
nsc env
to show where things are stored and backup Current Store Dir
and $NKEYS_PATH
.
Next repeat my script above, but this time, I would like you to prefix all nsc invocations as:
nsc --all-dirs /tmp/test
This will create the entire tree for keys and jwts in the specified directory (outside of where you keep your actual configs and keys) ie nsc --all-dirs /tmp/test add operator O
...
After you verify the repeatability, you can set your old operator to be current by doing:
nsc env -o <your real operator name>
@aricart can you please paste the server.conf?
im following this in depth guide, which talks about copying a server.conf, but I dont see the file in the docs? https://docs.nats.io/running-a-nats-service/nats_admin/security/jwt#concepts
(I'm trying to run stuff locally now; the above thread was from our prod kubernetes deployment)
Ah - for your local test all you need to do after you create your operator/accounts etc - is nsc generate config --nats-resolver --config-file server.conf
and then edit that to enable the full resolver. Set the allow_delete
to true, the default template sets it to false and that will prevent deletes from happening - I believe that is my above script that I put above yesterday.
you can then start the server with nats-server -c server.conf
@aricart I was able to run all of your commands using the local nats-server
, so it's something to do with the remote.
@aricart alright - two more pieces of info!
I was able to do all of these steps in STAGE
. We have a staging nats supercluster. The delete worked there.
However, I can always stacktrace on PROD, here is a fresh test:
looks like its dying here: https://github.com/nats-io/nsc/blob/f85538f50bf28836af8b1790ec796957415dc197/cmd/push.go#L449
that's a pretty dense function with pointers and such, hard to tell exactly what's happening, I'm new to this codebase
OK I see some doggy code there - there's an error that is added to the report, but the process doesn't stop. Anyway I can get you to build nsc from a branch so that we can try to narrow this?
Two more questions:
Also - were you able to run two local tests with the different M3 binary that was failing - I am assuming that you did it on the stage and that worked, but on prod it failed - Also make sure that the operator keys in the production environment - the request to prune requires them.
You can verify that the keys are available by running nsc list keys
:
nsc list keys
+------------------------------------------------------------------------------------------+
| Keys |
+--------+----------------------------------------------------------+-------------+--------+
| Entity | Key | Signing Key | Stored |
+--------+----------------------------------------------------------+-------------+--------+
| O | OBPLSTGWUQH7F54UPQU7QE53F3YDWGY6OVQJ3IDBQTO6D2USHVBP6TE5 | | * |
| A | AAYIW3IRJQCEZEKHAW3SU6CZGOJGUJELDJYIJEDLQYCMU67PJ377NO7K | | * |
+--------+----------------------------------------------------------+-------------+--------+
[ T ] auth callout encryption target
The stored column should be checked.
@aricart yes if you provide a branch I can build and do like a go run nsc/
?
I was able to use the exact same M3 binary for your local tests, and against our env -O stage
operator. It's clearly something to do with the env setup for prod on my machine (my boss was able to run it), some uncaught error message.
I am betting one dollar that the operator keys are not all there, and it is failing to find one.
I found the issue - let me do the fix
@tommyjcarpenter If you want to try the branch for the fix, that would be awesome. - Note that you'll have to specify the -K
with an operator key - in your case I believe the key was not found (possibly it was ignoring other keys) - didn't see a confirmation on the list keys
I released a new nsc, you should be able to pull it with nsc update
@tommyjcarpenter thank you for reporting this - if the release doesn't fix it for you, please reopen this issue
Thanks I’ll try it out!
Is it available through homebrew? That’s how I install/update nsc
yeap - the update command will also do the right thing even if you installed with homebrew...
@aricart it actually works with the latest... I didn't get an error message about missing an operator key..! It worked.
What was the change (and should it have worked?). Thanks much for your attention
The code had a couple of edge cases. But the one that affected you was that your operator had more than one signing key but the first listed were not there. Go has a feature where if something returns an interface the value can be nil, but the interface isn't. So checks for nil fail since the interface is not nil and then usage of that will panic because the value is.
What version were you using?
nsc version 2.8.5
i dont have "nats-server" installed
What environment was the server running in?
Mac OSX 14.2.1
Is this defect reproducible?
I see an account after an
nsc pull -A
its there when i list them:
however, trying to do an
nsc delete
followed by apush -P
, OR apush -R
both stacktrace:Given the capability you are leveraging, describe your expectation?
im trying to delete a NATs account
Given the expectation, what is the defect you are observing?
stacktraces with no error messages