purescript / registry-dev

Development work related to the PureScript Registry
https://github.com/purescript/registry
97 stars 80 forks source link

Clarify “letters” in spec #682

Open toastal opened 10 months ago

toastal commented 10 months ago

Must contain only letters and digits, optionally separated by hyphens Must begin with a letter or digit

Are uppercase letters allowed? Is lower é a letter? Is lower α? Is ? Is ? Is 🍉?

acceptedChars = Parsing.Combinators.choice [ Parsing.String.Basic.lower, Parsing.String.Basic.digit ] <|> Parsing.fail charErr

https://pursuit.purescript.org/packages/purescript-parsing/10.2.0/docs/Parsing.String.Basic#v:lower

I’m assuming using lower allows anything from a bicameral script (just like how PureScript identifiers work). Asking because I got really frustrated by OCaml+OPAM’s limitations to ASCII only for packages/identifiers/modules, & this spec isn’t clear on what ’letter’ means.

thomashoneyman commented 10 months ago

cc: @f-f on this, as I don't personally have an opinion. As far as I know the intention is to accept lower-cased Unicode characters in UTF-8, minus some special characters, but perhaps it was meant to be ASCII all along.

thomashoneyman commented 10 months ago

There are some explicit test cases here:

https://github.com/purescript/registry-dev/blob/4cd57a13430c766a94c6642da9b557a431b81724/lib/test/Registry/PackageName.purs#L66-L84

toastal commented 10 months ago

I hope there isn’t like a filesystem limitation. It would be awesome if the support was there so you could have résumé-builder or β-reduction.

toastal commented 10 months ago

Regardless, I think there is room to spell out the English in the spec a bit more clearly. …And add some clarifying test cases like hammer-of-þórr.

toastal commented 10 months ago

Actually I think there would be something important to bring up: Unicode confusion. This is Latin a, this is Cyrillic а. These can be used in certain attack vectors that haven’t yet been looked at.

If I were to suggest a solution I would look at the (now defunct sad) ocaml-m17n.

Can't look-alike characters like a and а be confusing?

They can.

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色.

Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

Which in this case would mean each kebab-separated ‘word’ can only be of one script type--which is now adding quite a bit of complexity with the trade-off being safety.

JordanMartinez commented 10 months ago

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色. Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

So, the proposed goal here would be to

  1. allow users to use non-ascii characters in a package name
  2. prevent users from installing a wrong/malicious package due to using lookalike characters

I'd also guess that another goal is to make it possible to type such a package name for all users. For example, a user with a Latin keyboard may not know how to type a CJK character whereas the reverse is often true

There also seems to be a spectrum for how to deal with this issue:

  1. make this impossible via a whitelist: prevent users from using non-ascii characters
  2. make this impossible via a blacklist: ban lookalike characters from being used, such that only one of such character sets is ever permitted (e.g. Latin a is used and all other as like Cyrillic a is banned)
  3. make this unlikely by warning the user: if multiple scripts are detected, notify the user and force the user to accept accept it before continuing
  4. make this impossible by forcing the user to opt-in to some additional non-textual identifier. For example, multiple script usage is allowed but such usages must be prefixed by ascii digits that correspond to a hash of the name. Thus, you could have resume-builder and 13452#résumé-builder. The usage of 13452 would clearly indicate whether or not the package is the intended one or not.
f-f commented 10 months ago

There are several barrels of worms that would be cracked open by allowing the whole of unicode in these names - no one can type them directly, a series of attack vectors, etc.

The original intent was to only allow ASCII, and I strongly feel we should stick to that since I don't believe the upside of more varied package names is worth the hassle here.

toastal commented 10 months ago

@f-f @JordanMartinez I’m happy y’all got back since it does seem this is a can of worms indeed & should be clearer in the spec docs.

Personally, I like the idea of losing American-centrism (the ‘A’ in ASCII)--even if it still limited to bicameral scripts. Deburring résumé-builder to resume-bulder kinda sucks & leaves non-English speakers having to make compromises on package names, which greatly favors English package names since English, the language, likes to pretend all the accents don’t matter since our vowels are an absolute mess with or without accents. Has anyone tested if you can currently push multiple languages? If so, does there need to be a bigger issue raised?

That said, I can understand favoring the simplicity (at least for now). In the case of OPAM, apparently can only upload packages to the registry named in ASCII, however, this shouldn’t be a limitation on OCaml package names that don’t plan to be added to the main public registry (tho in practice dune needs it to be a valid module name, & OCaml doesn’t non-ASCII module names). I would like clarification that I shouldn’t have an issue naming a project pokémon-center or déjà-vu-café or prelüde even if it were not to be uploaded? Also if a user/org ran a private registry could they hypothetically relax those rules (say they are based in Ukraine & prefer Cyrillic package names as their PureScript identifiers are currently supported in Cyrillic)?

Which goes out even further on a tangent: is PureScript Registry project even capable of allowing private registries? Are there mirrors for the public registry for resilience like in the Perl community? (I think about this a lot after seeing various GitHub & SourceHut outages).

f-f commented 10 months ago

leaves non-English speakers having to make compromises on package names

I'm not a native English speaker - Italian living in Finland, so I get to use all the various äåöèéàòùì on a regular basis - and no, not having characters outside of ASCII available for package names does not really bother me, which makes me think this is not such a universal issue. FWIW the PureScript compiler does not accept non-ASCII characters in module names, so that'd be the place to look at first.

I would like clarification that I shouldn’t have an issue naming a project pokémon-center or déjà-vu-café or prelüde even if it were not to be uploaded?

Spago uses the same package-name parsing code as the Registry, so you wouldn't be able to build your project with it. Other things might work I guess?

is PureScript Registry project even capable of allowing private registries? Are there mirrors for the public registry for resilience like in the Perl community? (I think about this a lot after seeing various GitHub & SourceHut outages).

Private registries and mirrors are not the same thing - Spago allows for private package sets because everyone needs custom packages at some point, but we purposefully did not choose to allow for private registries to avoid incentivising fragmentation. E.g. I'm not a fan of the OCaml situation, where every big company has their own compiler and package ecosystem.

Mirrors of the official registry instead have been accounted for, and are on the roadmap. I do not recall if the spec contains any mention of this yet, but the original RFC calls this "storage backends"

toastal commented 10 months ago

Sounds like we at least have some new test cases + descriptions to add regardless 😅