sqids / sqids-spec

Sqids specification for encoding multiple numbers into a single ID
https://sqids.org
43 stars 3 forks source link

A few questions #5

Closed aradalvand closed 1 year ago

aradalvand commented 1 year ago

Hi @4kimov, as I was building sqids-dotnet a number of questions popped into my mind which I'd figured I ask here (the first one is the most important one):

1.

It seems like in the new algorithm, any string that contains valid characters (valid meaning characters that are found in the alphabet) will always be decoded into some number; I tried feeding the decode function random strings and it gave me a result every time.

This looks like a crucial point of difference compared to the old Hashids, which was not like this. This means that in Sqids, any given number effectively has multiple valid encoded representations. I'm not sure if this was also technically true of Hashids but in Sqids it's far more extreme and apparent, since basically any string you give it, it always yields a number.

So, my question is: Was this somehow intentional? Or was it merely a side-effect of the simplification of the algorithm?

This can be problematic. In the context of a web app, for instance, where these IDs are typically used as part of the URL to identify a resource, this practically means that multiple different URLs can point to the same resource, which results in URL duplication and is virtually always undesirable. For example — with the default Sqids options:

So, if you have a product, say, with the ID 3168, you'll have more than one URL that points to it:

This is not good. Have you thought this through?

2.

Why does the min-length have to be less than the length of the alphabet? Characters are allowed to repeat in IDs, right? I'm probably missing something, but I'm not sure I understand the logic behind this particular limitation here: https://github.com/sqids/sqids-spec/blob/d33db792d4923186e0f24b68f8c15b0d8ae11f99/src/index.ts#L39-L47

3.

In the decode function, wouldn't it make sense to add another if clause like the one below that returns an empty array if the input ID contains fewer characters than minLength? https://github.com/sqids/sqids-spec/blob/d33db792d4923186e0f24b68f8c15b0d8ae11f99/src/index.ts#L201-L204

4kimov commented 1 year ago

Hi @aradalvand, good questions.

  1. Yes, intentional. The new algorithm relies on a default blocklist that will most likely be updated more than once. Which means that previously generated IDs might all of a sudden be considered blocked. If it's the end user that has changed a custom blocklist, perhaps they're aware of what IDs on their end need adjusting, but if it's us that have updated the default blocklist then they won't likely be aware - and some of their IDs might simply stop being decodable. That's why the decision to check for this kind of stuff has been offloaded to the end user: they can re-encode the numbers and check if IDs match (if they care about checking for this kind of stuff at all). Obviously Hashids did not have a blocklist so it did not have this problem and we could simply check the validity from within the algorithm.

  2. When we pad the ID with "junk" characters to satisfy minLength, it was just easier to grab a chunk from the available alphabet. If the needed chunk is larger than the alphabet, then we have to figure out how to grab more characters. Easiest way is to append the alphabet to itself as many times as you need until you can grab enough junk characters. But then you will have a noticeable repeat pattern within the IDs (imagine the end user gives minLength of 1000). More clever way is to randomize alphabet every time you append itself to itself, but then you'd be making CPU work harder. All of this is to say that it seemed like it wasn't a big enough problem to justify messing around with all of this. If anybody ever has a unique use-case where support for a larger minLength is that important, we should be able to fancy it up later without breaking existing IDs.

  3. I can imagine a scenario where the end user has been generating IDs, and then midway decide that they want to generate new IDs with minLength. I think in that case the decode function might not work for initially generated IDs because the algo would consider them invalid since they're too short. Again, probably better to let the end-user decide.

aradalvand commented 1 year ago

@4kimov Thanks for the elaborate answer.

Yes, intentional. The new algorithm relies on a default blocklist that will most likely be updated more than once. Which means that previously generated IDs might all of a sudden be considered blocked. If it's the end user that has changed a custom blocklist, perhaps they're aware of what IDs on their end need adjusting, but if it's us that have updated the default blocklist then they won't likely be aware - and some of their IDs might simply stop being decodable. That's why the decision to check for this kind of stuff has been offloaded to the end user: they can re-encode the numbers and check if IDs match (if they care about checking for this kind of stuff at all). Obviously Hashids did not have a blocklist so it did not have this problem and we could simply check the validity from within the algorithm.

Ah okay, that makes sense.

The idea of re-encoding the decoded number and then seeing if the result is the same as the original input ID, as a way of checking whether or not the original ID was the "canonical" encoding of the number did actually occur to me.

I'm pretty sure I'll personally do it if I use Sqids in a web project to prevent the URL duplication problem, so I think it would be helpful to hint at this in the README files of the Sqids projects so that people are aware of it, and possibly also describe this re-encoding technique for people who might want to do it. This is a pretty notable characteristic of Sqids, but it's easy to miss, and can have real practical implications in cases where the IDs are being used in website URLs and such, which is presumably the most common use case.

Wouldn't you agree?

When we pad the ID with "junk" characters to satisfy minLength, it was just easier to grab a chunk from the available alphabet. If the needed chunk is larger than the alphabet, then we have to figure out how to grab more characters. Easiest way is to append the alphabet to itself as many times as you need until you can grab enough junk characters. But then you will have a noticeable repeat pattern within the IDs (imagine the end user gives minLength of 1000). More clever way is to randomize alphabet every time you append itself to itself, but then you'd be making CPU work harder. All of this is to say that it seemed like it wasn't a big enough problem to justify messing around with all of this. If anybody ever has a unique use-case where support for a larger minLength is that important, we should be able to fancy it up later without breaking existing IDs.

Makes sense.

I can imagine a scenario where the end user has been generating IDs, and then midway decide that they want to generate new IDs with minLength. I think in that case the decode function might not work for initially generated IDs because the algo would consider them invalid since they're too short. Again, probably better to let the end-user decide.

Understood.

4kimov commented 1 year ago

Wouldn't you agree?

Yes, I would. Feel free to add whatever you think is appropriate to the README. I've pushed an update to the site that includes a high-level question regarding encoding/decoding (number 4 under FAQs), and looks like the spec repo already mentioned something about that (last sentence under notes).

aradalvand commented 1 year ago

Great, thanks.