Spawning fails for user names starting with non-alphanumeric chars

yuvipanda commented 3 years ago

Bug description

When a user with a username that starts with a non alphanumeric character (like '-' or '_') tries to spawn, it fails with:

{
    "kind":"Status",
    "apiVersion":"v1",
    "metadata":{

    },
    "status":"Failure",
    "message":"Pod \"jupyter--5fusername\" is invalid: metadata.labels: Invalid value: \"-5fusername\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')",
    "reason":"Invalid",
    "details":{
        "name":"jupyter--5fusername",
        "kind":"Pod",
        "causes":[
            {
                "reason":"FieldValueInvalid",
                "message":"Invalid value: \"-5fusername\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')",
                "field":"metadata.labels"
            }
        ]
    },
    "code":422
}

Expected behaviour

A pod starts for the user

Actual behaviour

A user server spawns a pod with a label for the user's name. Label values can't start with non-alphanumeric chars, so this fails.

How to reproduce

Setup simple kubespawner instance and try logging in with a username starting with '_'

yuvipanda commented 3 years ago

I've dealt with this temporarily by removing the username label. I see the three options as:

Remove label completely
Add a prefix (like we do for pod name)
Make the label opt-in

We add a username annotation, which doesn't have this restriction.

Most users / admins won't have a strong need to target by username label, so my preference is to do (3)

meeseeksmachine commented 3 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/advanced-z2jh-deeply-customizing-the-spawner/8432/1

consideRatio commented 3 years ago

I lean in favor of option 3 as well.

djangoliv commented 2 years ago

The same problem happen with @ in username. example with a username like: user@factory.org

yuvipanda commented 1 year ago

@djangoliv this shouldn't be happening for labels starting with @ as they should be escaped by escapism. Are you still seeing this issue?

minrk commented 1 year ago

I agree that annotations make more sense. What's the case for usernames in labels? Metrics, mostly (e.g. aggregating cpu usage by user)?

yuvipanda commented 1 year ago

@minrk yep, metrics.

minrk commented 1 year ago

FWIW, I was just today working on solving the ~same problem for user subdomains in JupyterHub in https://github.com/jupyterhub/jupyterhub/pull/4471 and have a scheme that I think always produces a unique safe string (DNS label in my case, similar rules here) from any arbitrary string:

implement trim-and-hash scheme that always produces a unique, safe string, according to rules (I chose u-{filter(lowercase_chars+digits)}--{hash})
check for escape signifier (-- in my case), fallback on trim-and-hash if found to avoid collisions with escaped results
if string is valid, use it as-is
(optional) try a less-disruptive escape (IDNA in my case); if valid, use it
fallback on trim-and-hash

This satisfies my goals:

always produces a valid result
uses the 'nice' value where it works
guarantees no collisions between 'already-escaped' nice values and escape results (bonus: without introducing the -2d problem, necessary for escapism itself to avoid collisions)

Of course, you can just always use the trim-and-hash scheme, if you don't mind the hashes on the end of the labels (k8s users are used to meaningless junk on the ends of most strings). Then you don't need the multi-stage escape. But if you like having the simple strings in the 90% of cases where they are valid, then this approach ought to work, and be more robust than using escapism which doesn't address things like start/end character or length requirements.

minrk commented 1 year ago

btw, to be explicit - I support option 3 as well. The only issue is that users are still affected by this bug, it's just a subset of users who opt-in to the known-to-work-only-sometimes feature. So the bug doesn't go away, it just comes up less often.

consideRatio commented 1 year ago

Note that this is the challenge, our escaping isn't enough because it doesn't give special treatment to the criteria of not starting/ending with a alphanum.

apiVersion: v1
kind: ConfigMap
metadata:
  name: label-demo
  labels:
    environment: -production
data:
  test: hi

kubectl apply -f test.yaml

The ConfigMap "label-demo" is invalid: metadata.labels: Invalid value: "-production": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

minrk commented 1 year ago

Right. All the rules must be taken into account, including start, end, and length, not just body characters. The trouble with escaping alone is that it doesn't handle length or boundary conditions. And the challenge with multiple schemes is avoiding collisions.

We can do roughly:

def best_safe_slug(name):
    if could_collide(name):
        # if the input matches a pattern that could be produced by either scheme, make sure it gets escaped again
        return always_safe(name)
    nice_name = nice_scheme(name) # guaranteed unique, not guaranteed to be valid, e.g. the name itself
    if is_valid(nice_name):
        return nice_name
    return always_safe(name)

This scheme can be used to produce all of our restricted strings (resource names, labels, etc.) with different functions for is_valid, even after templating, and guarantee we always produce a valid, unique value. It's mainly a question of when things like truncate & hash are used (i.e. always or only when needed), because we can't avoid it if we want to accept every valid input.

jupyterhub / kubespawner