Sqids adds too many junk characters when encoding multiple numbers with a min length specified

aradalvand commented 1 year ago

Hi there, I just noticed something: When no min length is specified:

const sqids = new Sqids({
    alphabet: '3SO4VRHEP8K0W1NJMCBT6DYA79Q2ZIG5UXFL',
});

console.log(sqids.encode([0, 1000]));
console.log(sqids.encode([0, 1001]));
console.log(sqids.encode([1, 1]));
console.log(sqids.encode([1, 2]));

The result:

P4X75
VIYB0
FHQ5
ZDT5

Now, if you set minLength to 5, this will be the result:

P4X75
VIYB0
L428TX
FS856C

Why did the last two suddenly jump to 6 characters?! That seems unnecessary given that they were just encoded with 4 characters, why does specifying the minLength cause Sqids to add 2 more characters, rather than 1, which would've been sufficient to meet the min-length requirement?

This doesn't actually make sense, correct me if I'm wrong but it seems like some sort of a bug. It can be especially undesirable since usually the point of setting a min length is aesthetic uniformity, which this particular quirk basically defeats.

4kimov commented 1 year ago

In the first 2 examples, the min length requirement is met, so there's nothing extra to do.

In the last 2 examples, the min length requirement is not met, so the code enters the additional padding section. This is how it works:

It checks if the "partition" character is set (junk data used for incrementing in case blocklist match is found in the future). Because it's not set initially, it's prepended to the array, so for [1, 1] the new array becomes [0, 1, 1] (first element is partition, the last two are actual data)
Because a new number has been added, a new separator is needed to separate partition number from the actual data.

So for example: encoding [1, 1], the id breakdown for L428TX becomes:

L is the randomizer offset that every id has
4 is partition number (0)
2 is the separator
8 is first real data element (1)
T is the separator
X is second real data element (1)

aradalvand commented 1 year ago

@4kimov I see, but wouldn't it make sense to rethink the logic for adding junk characters so as to avoid this? I also tested with Hashids and it doesn't have this problem (adds only as many characters as needed). This is what it returns for [1,1] and [1,2] with a min length of 5:

V86FK
4KGFG

Sqids's current behavior here is unexpected and also potentially undesirable from the user's PoV, even though it's explicable in the context of the algorithm's details.

4kimov commented 1 year ago

Hashids is designed differently.

The previous algorithm doesn't have the same problem because a) it didn't have to handle a blocklist and b) it reserved too many characters initially - some as separators, some as guards. This resulted in more patterns (for example I can see that "F" is a separator in both ids above). Additionally, because it isolated so many characters for internal use, there's less alphabet used for encoding actual data (smaller alphabet = larger IDs), so there's that downside also.

Sqids's current behavior here is unexpected and also potentially undesirable from the user's PoV

I can't speak for desirable or undesirable, but it should certainly be expected. The parameter is called minLength for a reason - the only thing it promises is that the generated ID will be at least that length.

aradalvand commented 1 year ago

I can't speak for desirable or undesirable, but it should certainly be expected. The parameter is called minLength for a reason - the only thing it promises is that the generated ID will be at least that length.

@4kimov Technically true, but the purpose for which this parameter is often used in practice is aesthetic; you typically set a relatively high min length so that your IDs look uniform (e.g. that an entity with the ID of 2, and another with 394 have public IDs that are the same length) — in fact, minLength doesn't have virtually any other applications, as far as I can tell, does it?

And this is an "expectation" that Sqids currently violates in the aforementioned scenarios.

This resulted in more patterns (for example I can see that "F" is a separator in both ids above).

Regarding this point in particular and as a side-note, Sqids seems to have the recognizable pattern problem as well, just in different ways. For example, another thing I stumbled into is that e.g. [3, 1] and [1, 3] are encoded as F85NHC and F853H0 respectively (the first part F85 is common between the two), hinting at a commonality between the values behind the two IDs. I've seen many instances like this, where a common number in a set of numbers encoded by Sqids yields identical bits. I didn't observe the same thing in Hashids.

I'm not trying to suggest that Sqids should just do whatever Hashids is doing, Sqids is obviously an improvement over Hashids in a lot of ways and was meant as such, I'm just saying that there are a few things that Hashids seems to do better, which Sqids should probably take inspiration from. And this is a good time to talk about that stuff, before Sqids acquires wide adoption, at which point breaking changes would no longer be feasible.

The junk character insertion, I would argue, is clearly one of those things. Isn't there a way this specific part of Sqids's algorithm can be reworked to solve this issue?

aradalvand commented 1 year ago

I might be missing something but what would be wrong with simply adding a long enough slice of the alphabet to make up for the missing length? See this PR (made the PR just so you can see the code diff), seems to work, the current algorithm is already partly doing this.

aradalvand commented 1 year ago

ping @4kimov?

4kimov commented 1 year ago

I'm not sure I have much else to add to this, since I've given the reasons for the technical choices taken + the limitations. I am not sure I see how your PR would address accidentally appending a blocked substring. I also personally don't consider this issue to be that big of a problem.

Having said that, if you do find an improvement that has an elegant fix (meaning does not increase complexity of code and has clear benefits), please create a PR in the spec code (with working tests), and people can chime in on whether it's worth asking everyone to pivot to that or not.

I'll close this discussion for now.

aradalvand commented 1 year ago

I am not sure I see how your PR would address accidentally appending a blocked substring

@4kimov Don't the next few lines of code literally take care of that?! If it does "accidentally append a blocked substring" then the blocked word logic kicks in immediately afterwards. What's the problem?

Having said that, if you do find an improvement that has an elegant fix (meaning does not increase complexity of code and has clear benefits)

I've presented what I believe to be just that. Feel free to correct me if I'm mistaken.

I also personally don't consider this issue to be that big of a problem.

It's little problems like this compounding together that could make Sqids vs. Hashids even a debate. I find it surprising that you're just hand-waving these away.

sqids / sqids-spec

Sqids adds too many junk characters when encoding multiple numbers with a min length specified #6