Closed twisted-trac closed 20 years ago
Automation removed owner |
---|
@glyph commented |
---|
#!html
<pre>
+0. UTF-8 is an 8-bit encoding, which means it's not just ASCII. You
can encode NULs and so forth. Some backends expect a different
encoding and a different translation to unicode (pre-existing
authentication databases going against Oracle with JIS japanese
encoding, off the top of my head).
It's doable, and for 99% of the cases out there it won't make a
difference to client code, but there are still other considerations.
What if the Avatar ID is some encoding of an integer, and not a string?
I'd like to avoid doing this until somebody who really knows unicode
can tell us how.
</pre>
@glyph commented |
---|
#!html
<pre>
> My attempt is to firm up the interface *now* and *perhaps* loosen
> it *later* if and when we have a use case.
My goal is the same. My rationale for suggesting the encoding strategy
is to say to potential users, "This is the interface: str=>str. We are
not supporting unicode, and we're doing that on purpose. Encode it
UTF-8 if you must, because that at least looks like ASCII some of the
time. If you have a better idea for how this should work, let us know,
but in the meanwhile DON'T decide to return random junk like
EncodedUsername("HELLO", "latin-1") from your requestAvatarId in order
to support internationalization: return a string or your code cannot
possibly work with other peoples' realms."
> For example, do you think
> we should support unicode avatar ids in files? [in which case a utf-8
> thingy might be sane]
Yes.
> Do you think we should support unicode avatar ids
> from databases? [in which case it's better to work with opaque objects
> and do no encoding/decoding at all]
Yes, but the *way* we should support unicode from databases with our
current interface would be to encode to utf-8 on one side of the
interface and decode on the other. We don't have a clear idea of what a
good opaque object would be.
> What happens when a unicode conversion
> error happens when trying to see if an avatar id belongs to a checker?
> Do we treat it as user-not-found or as
> catastrophical-bug-argh-shut-down-connection?
Well, that's up to the checker, to some extent. If implemented
properly, "user not found". If not implemented properly,
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)" or some variation on that: this will still be
handled just fine by the checker and will not yield any particularly
sensitive information to the client.
> I have no answers for any of those. In my sole experience with a
> Hebrew/Arabic/English site, all the usernames [avatar IDs in cred-speak]
> are in plain ASCII [and I'm *pretty sure* it's not a database problem
> but a decision non-technical users made] So I'm pretty sure
> unicode-in-usernames is a) not an important issue b) a world of hurt.
> Hence, I would tend to discourage writing support for it until someone
> comes with a clear use-case ["I need Japanese username support. My
> users hate UTF-8 because it was invented by white people. I have a
> colon-separated file with usernames in SHIFT-JIS and passwords in
> ASCII. How do I use cred?" is a somewhat tongue-in-cheek but not
> *entirely* unrealistic, and I'd hate to tell this guy "well, we made
> some decisions about unicode incompatible with your needs. Nobody really
> uses unicode though, so let's try breaking unicode compatibility and
> see what happens."]
SHIFT-JIS encoded text by itself can easily be brought back and forth to
unicode, no? Doesn't the problem with "white people invented it" only
arise when you *mix* encodings? e.g. some BIG5 and some JIS on the same
webpage?
In other words, just have him have his checker know that the data
storage format is in JIS, but pass the username around encoded UTF-8
when it goes to the realm. Any display software he's writing along with
this will have to store the region-encoding hint along with their avatar
so that when he adds korean support, it will know which usernames came
from korean-encoded chinese vs. japanese-encoded kanji, but the extra
encode/decode/encode step will still go through unicode on its way
through the avatar and not cause problems or lose information.
Also, if I'm wrong (having never been *directly* involved with this kind
of asian-language insanity, I'm sure my understanding is at least
partially flawed) let's say he has to have special knowledge in his
realm of his checker. It's not the end of the world. With such an
unusual use-case, it is unlikely he will require integration with other
peoples' cred software, but *even if he does*, if he just has a wacky
encoding scheme, the sysadmin can just do a little work in the realm's
storage layer to make sure that it matches up with the peculiarities of
his encoding, manually running some scripts to go from SHIFT-JIS to
UTF-8 if necessary. This is, after all, what sysadmins do :).
*BUT*, this only works if we encourage some sanity in that we do not say
"we don't know how unicode should work at all, just do whatever you
want" - thus encouraging anyone with a unicode-ish use case, even
someone who knows considerably *less* about the potential problems that
entails (think newbie ex-java programmer here) they may decide to come
up with a whole secondary framework for username encodings, along with
self-hashing subclasses of string or other insanity, rather than just
adhering to this simple convention, because it is "cleaner" not to have
to call .encode or .decode in their application logic.
Hopefully this is clear. This suggestion is intended to preserve the
existing interface in all instances where it can be preserved, and to
give some boundaries for people who *THINK* their use-case is not supported.
Another thing that is probably going to make this discussion moot is the
emergence of UID/GIDs in just about every system I've been writing that
uses cred. It seems that a very likely pattern is that every user has a
numerical ID (RDBMS primary key, UNIX uid, storq storage ID, ZODB/cog
oid) and you should not even use usernames as avatar IDs at all if you
can avoid it.
</pre>
@glyph commented |
---|
#!html
<pre>
Okay, it's clear nobody is totally sure how to do this correctly, so I'm
removing the release1.1 tag. It's still a doc bug, and we need to find someone
who has real unicode-login-name use cases and be sure that the solution I've
outlined below works. But I can't see why it wouldn't.
</pre>
@itamarst commented |
---|
#!html
<pre>
To clarify - credentialcheckers generate an avatar id, which
is passed to realm. We need to make sure realms work with
*all* credential checkers, and if most checkers generate
strings and the realm assumes this, and then admin changes
to a checker that produces unicode, the realm will *break*.
So, at the minimum we need to require realms to support
unicode, which we do not do at the moment.
</pre>
@itamarst commented |
---|
#!html
<pre>
If we document that avatar ids can be both unicode and
8-bit, we should be ok, since all realms will be able to
deal with both by downgrading or upgrading, depending on how
their storage works. So if we ever decide to restrict it to
unicode only it will still work.
</pre>
@glyph commented |
---|
#!html
<pre>
Ahem.
"It is not intended to be used for to prepare identities
which are not simple user names (e.g., distinguished names
and domain names). Nor is the profile intended to be used
for simple user names which require different handling.
Protocols (or applications of those protocols) which have
application-specific identity forms and/or comparison
algorithms should use mechanisms specifically designed for
these forms and algorithms."
I don't understand what that spec is trying to say.
How about this - for comparison and such, we will always
call 'credStringValue.decode("utf-8")'. This will disallow
non-ASCII characters in non-unicode strings, but will still
allow unicode strings.
</pre>
@moshez commented |
---|
#!html
<pre>
Should we just document that "currently, unicode IDs are not supported -- if
you have a use case, please explain it in a bug report" and close this? I'm
loath to add any more code, or *EVEN DOCUMENT AN APPROACH* if we have nfi
what we are talking about. I'd feel much safer supporting stuff with a use
case in mind [even if the support comes to documenting stuff].
</pre>
@moshez commented |
---|
#!html
<pre>
Here's a patch to document the non-supportingness of unicode strings
Index: doc/howto/cred.xhtml
===================================================================
RCS file: /cvs/Twisted/doc/howto/cred.xhtml,v
retrieving revision 1.5
diff -u -r1.5 cred.xhtml
--- doc/howto/cred.xhtml 17 Oct 2003 04:46:19 -0000 1.5
+++ doc/howto/cred.xhtml 19 Oct 2003 12:21:43 -0000
@@ -128,6 +128,12 @@
<p>This method will typically be called from 'Portal.login'. The avatarId
is the one returned by a CredentialChecker.</p>
+<div class="note">
+Avatars, currently, can only be strings. Passing unicode strings around,
+in particular, is <em>not</em> supported by the infrastructure. If you
+find a need for unicode usernames, please file a bug with your specific
+use-case.</div>
+
<p>The important thing to realize about this method is that if it is being
called, <em>the user has already authenticated</em>. Therefore, if possible,
the Realm should create a new user if one does not already exist
</pre>
@glyph commented |
---|
#!html
<pre>
I don't have no idea whatsoever, I just don't know that this is a
panacea. Python's encoding support is very well done, so it's not like
we're designing from scratch either.
Considering that this approach will continue to work even if we firm up
the spec so that it's no longer really necessary, I'd still like to
suggest it, rather than having folks who *really* have NFI what they're
talking about come up with some cockeyed idea where they just have
magical realms that emit some other random instance object from
requestAvatarId rather than actually using this "workaround" for
conforming to the interface.
</pre>
@moshez commented |
---|
#!html
<pre>
I didn't understand a word you said.
Please attempt to be clearer.
My attempt is to firm up the interface *now* and *perhaps* loosen
it *later* if and when we have a use case. For example, do you think
we should support unicode avatar ids in files? [in which case a utf-8
thingy might be sane] Do you think we should support unicode avatar ids
from databases? [in which case it's better to work with opaque objects
and do no encoding/decoding at all] What happens when a unicode conversion
error happens when trying to see if an avatar id belongs to a checker?
Do we treat it as user-not-found or as
catastrophical-bug-argh-shut-down-connection?
I have no answers for any of those. In my sole experience with a
Hebrew/Arabic/English site, all the usernames [avatar IDs in cred-speak]
are in plain ASCII [and I'm *pretty sure* it's not a database problem
but a decision non-technical users made] So I'm pretty sure
unicode-in-usernames is a) not an important issue b) a world of hurt.
Hence, I would tend to discourage writing support for it until someone
comes with a clear use-case ["I need Japanese username support. My
users hate UTF-8 because it was invented by white people. I have a
colon-separated file with usernames in SHIFT-JIS and passwords in
ASCII. How do I use cred?" is a somewhat tongue-in-cheek but not
*entirely* unrealistic, and I'd hate to tell this guy "well, we made
some decisions about unicode incompatible with your needs. Nobody really
uses unicode though, so let's try breaking unicode compatibility and
see what happens."]
</pre>
@moshez commented |
---|
#!html
<pre>
<moshez> glyph: transporting SHIFT_JIS correctly across unicode is a fairly
non-trivial task
<glyph> moshez: craptastic
<glyph> moshez: python's JIS encodings won't do it for you?
<moshez> that's why I chose SHIFT-JIS
<moshez> glyph: my understanding is that JIS->Unicode is a political issue
rife with difficulties centering around the difference between lots
of subtle concepts I've no idea about like the difference between a
character and a code point
<glyph> moshez: well, my point is, there is *SOME* way to encode what you want
as a string
<glyph> moshez: so the *convention* should be UTF-8
<glyph> if you can't do UTF-8, well, that sucks, but it's just a convention
anyway
<glyph> moshez: okay, but are we in agreement?
<moshez> glyph: well, I still dislike recommending a work-around [use utf-8]
without a clear view of the implication
<glyph> moshez: I think we've demonstrated that we have a clear view of 90%
of the implications
<moshez> glyph: I prefer "bug us with a use case"
<glyph> moshez: they won't
<moshez> glyph: so do you want to formulate a new note, and check it in?
<glyph> moshez: OK. I'll add something to the documentation tonight.
</pre>
@moshez commented |
---|
#!html
<pre>
Adding proposed formulation and marking patch.
Index: doc/howto/cred.xhtml
===================================================================
RCS file: /cvs/Twisted/doc/howto/cred.xhtml,v
retrieving revision 1.5
diff -u -r1.5 cred.xhtml
--- doc/howto/cred.xhtml 17 Oct 2003 04:46:19 -0000 1.5
+++ doc/howto/cred.xhtml 20 Oct 2003 16:42:39 -0000
@@ -128,6 +128,12 @@
<p>This method will typically be called from 'Portal.login'. The avatarId
is the one returned by a CredentialChecker.</p>
+<div class="note">
+Note that <code>avatarId</code> must always be a string. In particular,
+do not use unicode strings. If internationalized support is needed,
+it is recommended to use UTF-8, and take care of decoding in the realm.
+</div>
+
<p>The important thing to realize about this method is that if it is being
called, <em>the user has already authenticated</em>. Therefore, if possible,
the Realm should create a new user if one does not already exist
</pre>
Searchable metadata
``` trac-id__63 63 type__defect defect reporter__itamarst itamarst priority__high high milestone__ branch__ branch_author__ status__closed closed resolution__fixed fixed component__conch conch keywords__ time__1058173263000000 1058173263000000 changetime__1067083024000000 1067083024000000 version__ owner__ cc__glyph cc__radix cc__itamarst ```