Reduce the number of characters needed for specifying which package is being selected

JordanMartinez commented 1 year ago

The new spago's CLI allows one to select a package by referring to it via package name. However, this choice is somewhat problematic in a few situations:

when the package name is long, whether due to multiple parts within the name (e.g. node-some-special-package-name) or due to long words within the name (e.g. uncopyrightable-sesquipedalian-package courtesy of 13 of the Longest Words in the English Language
when the letters used within the package name happen to be a particular sequence of characters that the current user is prone to misspell (e.g. switching i and o in -tion)

I propose we make it possible to reduce the number of character used. There are a few ways to do this:

In spago.yaml, add a new field cli-alias in the package section that allows one to use that alias string in addition to the package name. Given a package name uncopyrightable-sesquipedalian-package and an alias lib, both spago install -p uncopyrightable-sesquipedalian-package and spago install -p lib modify the same package's dependencies).
In the CLI, add some sort of fuzzy matching on the package names to guess which package is the selected one.

Of these two, I prefer the first one.

f-f commented 1 year ago

I wonder if this is something that needs to be implemented in spago itself?

Terminals offer facilities to alias things to shorten them (after all we are not the first to have this issue), e.g. on linux/macos one can do

alias spago-foo="spago -p node-some-special-package-name"

..so then one can always refer to that package with spago-foo install somepackage

JordanMartinez commented 1 year ago

Terminals offer facilities to alias things to shorten them....

I think that would be a terrible UI. If I have 5+ projects (poly repo or not), you're saying I should define a single alias for each one of those? And I should update them if I ever rename those projects?

Are there any stronger reasons against putting an alias in a spago.yaml file?

If that were possible and I aliased lib to the corresponding project's main library package, then I could see a .bashrc alias for slb which expands to spago -p lib build. slb would be IMO a proper alias because it's something that works in multiple contexts.

f-f commented 1 year ago

Are there any stronger reasons against putting an alias in a spago.yaml file?

A principle I apply often when dealing with things is "love the problem, not the solution". Before committing our minds to implementing a particular solution it is often worth exploring the problem space, as there might not even be a need implement anything in the first place. Any feature we add needs to be maintained and documented, adds to the confusion of browsing the docs when looking for solutions, and in this specific case adds to the noise of other configuration values in the file. Even if I don't have anything against a particular solution (and I don't here), it's just worth thinking about what alternative solutions could look like, since there might be a more lightweight one.

f-f commented 1 year ago

For example, it just occurred to me that - riffing on my previous idea of fuzzy matching package names - we could come up with matching abbreviations for hyphened names. So e.g. node-some-special-package-name would become nsspn and uncopyrightable-sesquipedalian-package would become usp. We could deal with collisions by adding the first different character to the shortening, so e.g. if you had both foo-bar-baz and foo-bar-bat then the first one would be fbbz and the second one fbbt. Would something like this work for you?

I prefer some other approach (like this fuzzy matching) over adding a configuration value because:

it's one less config value that everyone needs to deal with
users don't have to configure anything and it's already enabled for everyone

JordanMartinez commented 1 year ago

Ah, sorry. I didn't realize you were just brainstorming ideas in response to what I had proposed. Instead, I read it as you implicitly saying no and instead suggesting a different way to solve the problem.

So far, we've identified the following design questions:

how explicit does the user need to be? / how much "magic" is happening behind the scenes?
does it require supporting another field within spago.yaml?

Where the current ideas breakdown as:

Idea	Description	`spago.yaml`	Explicitness	Magic-ness
1	Don't implement feature	Unchanged	Explicit	None
2	Allow alias defined in config	New optional field	Explicit	None
3	Unique-enough-fuzzy-matching	Unchanged	Implicit	Mostly predictable

Within Idea 1 are shell aliases or some custom scripts someone could write (e.g. foo.sh that contains spago build -p long-package-name). This isn't ideal because of the issues I've already mentioned about shell aliases needing to be updated if the package name changes. As for custom scripts, I'd argue that they shouldn't be needed for a simple common task like this.

Idea 2 has already been described but there's another thing that would need to be done. In the event that two packages have the same unique alias (e.g. due to a copy-pasting a spago.yaml file in a multi-package workspace), spago would need to report an error. Ideally, this would happen immediately whether such a command that involves choosing a package was called or not, so that the user fixes it before it hits them later (e.g. in CI). Potentially, the user could be prompted in the terminal when the problem is discovered and force them to fix it there. However, I think I'd prefer the error over the prompt.

Idea 3 works likely 90% of the time but can be unpredictable. Take foreign-node-package. Originally, a command like spago build -p fnp would select that package. But now, we add a new package used as a script called find-new-packages. That script's fuzzy-matched name is also fnp. However, that's no longer unique enough. So to select foreign-node-package, the new unique-enough-fuzzy-matched name is spago build -p fonp, which breaks all the old scripts that originally relied upon spago build -p fnp. If you say, "Let's just fuzzy-match on the first two characters of each part of the name (e.g. fo-no-pa)", I would respond with, "But that doesn't solve the problem if the same thing happens even with that 2-character threshold. The obvious solution at that point would be to increase it to a 3-character threshold. But regardless of how this particular problem is addressed, lib is still shorter than any non-1-character threshold and scripts don't break unless the alias itself is explicitly renamed".

Unless there's another way this could be done or another issue I'm not considering, I still think Idea 2 has the current best tradeoffs.

JordanMartinez commented 1 year ago

@f-f Any new thoughts on this and whether to support this?

f-f commented 1 year ago

Yeah, I would like to support this. I am still of the opinion that idea3 has the better tradeoffs as it would work out of the box for everyone without need for configuration ("no need for configuration" is one of the reasons why people prefer esbuild over webpack, and it's one of Spago's north stars. The "automatic package discovery" for monorepos is on this same vibe).

I don't think "adding a new package with a clashing name" is a big concern - we'd recommend people to not use the short aliases in scripts, in the same way that we recommend people to not use branches as git refs.

JordanMartinez commented 1 year ago

I agree about the lack of configuration being a selling point. If my assumption about the probability of 2 or more packages having the same fuzzy match value is correct, then this might be a moot point. Even coming up with my example was hard. That being said, there is a better example worth considering. If we are going to migrate all of the core libraries into a single repo (and the same for Contrib, Node, and Web libraries), where would the current idea of a fuzzy matcher fail?

I also wonder if we could further specify the fuzzy matching idea to decrease the probability of duplicates like that happening. I did some searching and via a StackOverflow answer found the post, Comparison of String Distance Algorithms.

f-f commented 1 year ago

I also wonder if we could further specify the fuzzy matching idea to decrease the probability of duplicates like that happening

These are good finds. I'm not sure though that this is a typical example of fuzzy matching: the interface for the examples you linked is interactive, so the user can see how the search is performing while they are typing something. In our situation instead, we'd want to come up with some alias so that the user knows what they will have to type to get it. So in this sense this is not about trying to match any string, but more about how to best shorten something as much as possible while also trying to avoid conflicts.

UI-wise, I was thinking we'd show the aliases to the user when they type in any command that lists the packages, e.g.:

$ spago build
Refreshing the Registry Index...
Reading Spago workspace configuration...
Read the package set from the registry

✅ Selecting 4 packages to build:
    docs-search    (alias: ds)
    spago          (alias: s)
    spago-bin      (alias: sb)
    spago-core     (alias: sc)

Edit: I'm cursed with a terrible memory so I am not sure where I've seen this "shorten words to their first letters" kind of algorithm - though I'm sure I've seen it used before. It reminds me of Huffman coding and prefix trees, but it's far from either. As a sidenote, a key point is that whatever encoding we pick needs to be memorable for users, which "first letter" usually is.

f-f commented 1 year ago

Exploring your idea of how this would work for core: I got the list of all the repos that might be purescript libraries (there are some spurious results, but bare with me):

$ gh repo list purescript -L 100 | grep "purescript-" | cut -d' ' -f1 | cut -f1 | sed "s|purescript/purescript-||g" | sort
arrays
assert
bifunctors
catenable-lists
console
const
contravariant
control
datetime
distributive
docs-search
effect
either
enums
exceptions
exists
filterable
foldable-traversable
foreign
foreign-object
free
functions
functors
gen
graphs
identity
in-purescript
integers
invariant
json
lazy
lcg
lists
maybe
metadata
minibench
newtype
nonempty
numbers
ordered-collections
orders
parallel
partial
prelude
profunctor
psci-support
quickcheck
random
record
refs
safe-coerce
semirings
st
strings
tailrec
transformers
tuples
type-equality
typelevel-prelude
unfoldable
unsafe-coerce
validation

Shortening these with the algorithm I proposed would give us:

arrays               => ar
assert               => as
bifunctors           => b
catenable-lists      => cl
console              => conso
const                => const
contravariant        => contra
control              => contro
datetime             => da
distributive         => di
docs-search          => ds
effect               => ef
either               => ei
enums                => en
exceptions           => exc
exists               => exi
filterable           => fi
foldable-traversable => ft
foreign              => for
foreign-object       => fo
free                 => fr
functions            => functi
functors             => functo
gen                  => ge
graphs               => gr
identity             => id
in-purescript        => ip
integers             => int
invariant            => inv
json                 => j
lazy                 => la
lcg                  => lc
lists                => li
maybe                => ma
metadata             => me
minibench            => mi
newtype              => ne
nonempty             => no
numbers              => nu
ordered-collections  => oc
orders               => or
parallel             => para
partial              => part
prelude              => pre
profunctor           => pro
psci-support         => ps
quickcheck           => q
random               => ra
record               => rec
refs                 => ref
safe-coerce          => sc
semirings            => se
st                   => st
strings              => str
tailrec              => ta
transformers         => tr
tuples               => tu
type-equality        => te
typelevel-prelude    => tp
unfoldable           => un
unsafe-coerce        => uc
validation           => v

This list has some pretty juicy corner cases, so it's a very good candidate for writing a test once we implement this 🙂

f-f commented 1 year ago

There would be a fuzzy-matching part too - if the user would type spago build -p semi, we'd still build semiring, even if just typing se would have been enough. I guess this is actually prefix matching though, not very fuzzy 😄

JordanMartinez commented 1 year ago

I was typing this as you responded with the two comments above. Given the above list, I'm not sure how useful it would be now that we have some hard data. But I'll still post it.

How about this implementation? It's predictable, reduces the character count, and is flexible in how explicit one wants to be.

A user-inputted text matches a package name if, after splitting the user-input text and package name text into segments whenever a - character is found, the following are both true:

(segment check) the user-input segment at index n satisfies the prefix/suffix check for the package name segment n (if the user-input segment is empty, then this check is always satisfied
(segment length check) the number of user-input segments equals the number of package name segments
(first letter requirement) to prevent clashes with CLI flag/argument syntax, the first character must always be specified

Prefix/Suffix check. A user-inputted segment is further split via a * character. We use left-of-* text to do a prefix test on package name segment and right-of-* text to do a suffix test. When the * is absent, any text provided is a prefix-based test

Here are a few examples of the idea:

User input	Package	Match?	Why
`foo`	`food`	Yes	Absence of `*` means we do a prefix test, which passes; same number of segments: 1
`*ood`	`food`	Yes	"ood" is a suffix of "food" and the segment lengths are the same
`foo`	`food-bar`	No	The segment length differs
`foo`	`fo`	No	"foo" is not prefix of "fo"
`f-b`	`foo-bar`	Yes	both user-inputted segments are prefixes of package name segments
`o-r`	`foo-bar`	Yes	both user-inputted segments are suffixes of package name segments
`foo-b`	`foo-bar`	Yes	while the first segment is slightly more explicit than the above, it is still a prefix of the package's corresponding segment
`fo-ba`	`foo-bar`	Yes	while the segments are still slightly different, they still are prefixes of the package segments
`f-bar`	`foo-bar-baz`	No	user-segment length is one less than package name segment length
`f--b`	`foo-bar-baz`	Yes	the segment lengths are the same and the second segment is empty, so it passes the prefix/suffix check of the `bar` segment
`--b`	`foo-bar-baz`	No	because this would clash with CLI flag/arg syntax due to the usage of `-` as the first character
`*--b`	`foo-bar-baz`	Yes	because the first and second segments are empty while the third is a prefix of `baz`

This is simple to implement now without the need for additional libraries. Implementing things this way encourages the following:

by convention, packages are sequences of words separated by hyphens (convention used in most packages today)
segment-length check prevents confusion around whether f-bar will match both foo-bar and foo-bar-baz
indicating empty segments via - allows one to skip parts that are irrelevant
prefix-by-default approach allows one to be as explicit as they want (e.g. package foo-bar is selected when user types f-, f-b, foo-b, or the actual package name, foo-bar).
* character allows prefix-only, suffix-only, or both kinds of tests on segments

The combination of the segment length check and support for both prefix/suffix checks can help address my concern about how adding a new package could invalidate a command used elsewhere because the aliased package name there is no longer unique. In other words spago build -p f*o may have matched the package foo. If I later add a new package foo-bar, foo doesn't now also match foo-bar as the segment lengths differs. If I alter add a package named food, there's still no ambiguity because of the suffix match.

Using BNF grammar:

fuzzy-match: <segment> ('-' <segment>?)*

segment: <text>
  | <text> ('*' <text>)?
  | '*' <text>?

text: [a-zA-Z0-9]+

f-f commented 1 year ago

I think having to deal with characters like - and * adds a bunch of complications, as the shell will want to expand the *, and we'd have to be careful to not have the hyphens be accidental flags. Suffix matching is also going to be ambiguous and confusing, and likely unnecessary: in the example I posted above most of the packages can be addressed by typing 2 letters of their prefix

JordanMartinez commented 1 year ago

Ah, I forgot about * also being an issue in the CLI. Fair points.

Another way we could address my concern about commands changing is adding a command for creating a new subpackage. When this is done, we could also check to see which aliases will need to be changed once the new subpackage is added.

f-f commented 1 year ago

Another way we could address my concern about commands changing is adding a command for creating a new subpackage. When this is done, we could also check to see which aliases will need to be changed once the new subpackage is added.

Ah yes, would be nice to have a spago init --subpackage foo. We could then recompute all the aliases, tell the user the new one, and optionally issue a warning noting that the aliases have changed. I like this!

CharlesTaylor7 commented 1 year ago

Have shell completions been considered as a solution?

Git ships with shell completions which helps with switching between branches that may be long. I see this issue as being fundamentally similar.

I looked at the readme for spago, and noticed it already has a section on generating completions. This lead me to discovering that optparse-applicative makes it possible to customize completions of arguments.

https://github.com/f-o-a-m/purescript-optparse#actions-and-completers

So theoretically we could make the -p option autocomplete based on the defined packages in the workspace.

The user could type

spago build -p uncopy<tab>

and have that expand to

spago build -p uncopyrightable-sesquipedalian-package

JordanMartinez commented 1 year ago

That works so long as the shell supports that kind of completion, right? Like does this work on bash, sh, fish, zsh, and whatever other shell will come along? Or does it need special support for each of those shells? If the former, then this sounds like the better solution (though I'm not sure how much work that would be). If the latter, than I think Fabrizio's idea is still better.

f-f commented 1 year ago

This works on bash, zsh and fish, but it requires manual installation. I don't know if we can do it with npm or nix, but I'm afraid it will have to be manual. It's also not always clear how to install these actually - the readme documents the way I managed to install shell completions myself, but there might be a more reliable one.

In any case, this fits under the box "it's not enabled by default but a user has to", which hurts the convenience.

I'm fine with implementing both avenues at this point.

JordanMartinez commented 8 months ago

Coming back to this due to #1184...

Ah yes, would be nice to have a spago init --subpackage foo.

Looks like this idea is blocked on #1156

Have shell completions been considered as a solution?

But this idea is not blocked AFAIK.

purescript / spago

Reduce the number of characters needed for specifying which package is being selected #985