sqids / sqids-spec

Sqids specification for encoding multiple numbers into a single ID
https://sqids.org
43 stars 3 forks source link

Simpler #11

Closed 4kimov closed 1 year ago

4kimov commented 1 year ago

Since the initial library has been out for a few weeks, I got to mess around with it a bit more and realized there are a few rough edges we could iron out:

  1. Lower reserved characters count from 3 to 1 (more data packing into generated IDs)
  2. As a side-effect of number one, min alphabet length can decrease from 5 to 3
  3. Decrease blocklist-related re-generation attempts from almost-unlimited to the length of the alphabet (cc @tzvetkoff)
  4. As a result of number three, re-generated IDs don't have to grow in length (try blocking "SrIu" in the playground to see what I'm talking about)
  5. Also as a result of number three, min length padding is smoother now (cc @joviczarko)
  6. Sounds like minValue and maxValue are not that useful, let's remove? (cc @peterhellberg)
  7. Increase min length limit from alphabet-length to an arbitrary value (1_000 for example)

Pros:

Cons:

Unknowns:


Current: https://sqids.org/playground (algorithm code / explanation) Proposed: https://sqids.org/playground/simpler (algorithm code / explanation)


P.S: Also worth noting that I have no intention of changing the algorithm over and over; I figured this might be cleaner + address some current issues/questions, while we're still technically mostly pre-prod.

Thoughts/feedback?

cc @laserpants @vinkla @niieani @antimon2

peterhellberg commented 1 year ago
  1. 👍🏻 I'm fully on board with removing minValue and maxValue
miquelfire commented 1 year ago

I think maxValue should be an optional function. If a library is using, for example, 32-bit integers, having maxValue return a value that will not break the library would be nice to know how big a number can be before an overflow error occurs. For libraries that use a number system that is a bit insane (the PHP library has BCMath and GMP, and JavaScript could use BigInt whose max value would require over 2KB of memory to store!), they don't need maxValue

laserpants commented 1 year ago

This sounds reasonable. I am :ok: with those changes.

I am wondering about something, though: Since each language implementation defines its own maxValue (even if it is implicit), you can currently encode a sequence of numbers in one language, into an ID which then subsequently fails to decode from another language implementation (using the same alphabet and configuration). Could this be a problem?

For example, using the default config in the Haskell version I can generate an ID:

sqids (encode [9223372036854775807])
Right "Ack7vOaIrINC"

But Ack7vOaIrINC is an invalid ID in the JS version.

laserpants commented 1 year ago

Update: The below idea probably doesn't solve anything, since the fundamental problem here is more about integer precision, i.e., unless the number is encoded as a string, there would need to be a suitable data type (BigInt) to hold the number in the first place, in which case that type can probably be used directly for encoding/decoding. Not sure how this would work otherwise.

encode(int)
encodeBig(BigInt)

So maybe not. :thinking: But nevertheless, it could be nice to distinguish, when decoding, between invalid IDs, and IDs that are (technically) valid, but too large for the language to handle.


Would it be possible, and useful, to support arbitrarily large numbers by introducing a preliminary step such as the following:

  1. Remove one character from the alphabet and use this as a separator character.
  2. Split the number into groups of, say, 9 digits each. (Based on an assumption that all languages can handle integers <= 999999999.)
  3. Encode each of these groups (as before) using the remaining characters of the alphabet.
  4. Join the encoded IDs into one ID, separated by the character from step 1.

The number in my previous example, 9223372036854775807, would then be partitioned as:

922337203 685477580 7

With the default alphabet, a becomes the separator and the remaining characters bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 are then used to encode these numbers. With the old algorithm, we get:

9DVndvc OSFDIQz 6B

These IDs are then combined into the final ID, intercalating the a:

9DVndvcaOSFDIQza6B

4kimov commented 1 year ago

@peterhellberg Cool


@miquelfire Question for you. For PHP version, either one of the two extensions are required, but we still use PHP_INT_MAX for max value. Is my understanding correct that the value will dynamically change based on the system/extensions? If yes, sounds like maxValue() would be useful. And maybe you're right - perhaps it's worth making that function optional.


@laserpants Yeah, good thought process. This was actually considered in the very first version of Hashids - what if we split up large numbers and use a dedicated separator. But then the cost slowly started outweighing the benefit:

That's why we kinda rolled back into keeping the lib clean and simple, and letting the user handle upper limits of ints.

Edit: And to answer your question about Haskell version generating an ID that JS can't decode, to me it doesn't feel like it should be a library's issue because if the user has two systems (one running encoding via Haskell, the other decoding via JS), they should account for scenarios like these on their end. It would be our problem if we didn't communicate clearly what each language supports.

antimon2 commented 1 year ago

Good morning everyone (it's morning here in Japan).

Regarding maxValue(). I agree with "making that function optional". In fact, in my Julia implementation, by default, it processes within the range of the system's Int type (Int64) and maxValue() is set to return typemax(Int) (the maximum value of the Int type). However, with a custom extension, passing the option strict=false during configuration will remove the upper limit (using number types like Int128 or BigInt) and the implementation will throw a NoMethodError for maxValue().

As a further extension, I had considered it might be interesting to "specify a switch that represents the maximum value or range of values during configuration". For example, if you set mode=:compatible, it would handle only integers within the range JS(TS) can handle (53bits), and with mode=:default (the default setting), it would handle the range of integer types that language can handle (not extended precision). With mode=:unlimited, it would utilize BigInt to remove the upper limit (this would be invalid for languages that don't support it), and so on.

I will check the new code and specifications when I have time.

miquelfire commented 1 year ago

PHP_INT_MAX is from before Hashids required one of the two extensions. What it's defined as depends on if you compiled it as 32-bit or 64-bit. None of the bcmath or GMP code will even need to look at it. Though, if the value passed to the encode function is above PHP_INT_MAX it might convert to float and lose precision if it wasn't a string value to begin with.

laserpants commented 1 year ago

That's why we kinda rolled back into keeping the lib clean and simple, and letting the user handle upper limits of ints.

Thanks for clarifying this @4kimov. I had a feeling that this is something you'd thought about already. I agree that it should be up to the user to account for these inconsistencies. But doesn't this mean that we still need to make it possible to distinguish between invalid IDs and IDs that fail to decode because of overflow. What if the implementation of toNumber had something like the following:

private toNumber(id: string, alphabet: string): number {
  const chars = alphabet.split('')
  return id.split('').reduce((a, v) => {
    const u = a * chars.length + chars.indexOf(v);
    if (u > this.maxValue()) {
      throw 'Number out of range';
    }
    return u;
  }, 0)
}

In client code, one could then do:

function checkId(id: string) {
  try {
    const numbers = sqids.decode(id);
    if (!numbers.length) {
      console.log('Invalid ID')
    } else {
      console.log(numbers)
    }
  } catch (e) {
    // Number out of range
    console.error('Error: ', e);
  }
}

checkId("8QRLaD")          // [ 1, 2, 3 ]
checkId("adfadsfa")        // Invalid ID

// ID from my previous example, encoded via Haskell
checkId("Ack7vOaIrINC")    // Error:  Number out of range

If this creates a performance hit, then another option is to not do this check in toNumber, and instead introduce a new method canDecode(id). Users can then choose to test their IDs for overflow prior to decoding, if this is relevant to their use-case. For example;

if (sqids.canDecode(id)) {
  const numbers = sqids.decode(id);
  // Do stuff with numbers
} else {
  // At least one number in the sequence is too large for this language. 
}
peterhellberg commented 1 year ago

I'm a bit concerned about this turning into an XY-problem, do we even want to be able to decode from any unknown source? My understanding is that you are primarily meant to decode "your own" ids, making the cross language argument a bit less critical. (Since you then fully control what numbers you are encoding into your ids)

Also, we are still talking about unsigned integers, right? From the point of view of Go that would be being able to encode/decode []uint64, which might not be fully representable in let's say JavaScript. (Without resorting to BigInt that is)

Another approach would be to decide (in the spec) on some lowest common denominator for the maximum value all implementations should be able to encode/decode, and then not support languages with smaller (unsigned) integer types. This is not an avenue I'd prefer to go down, but thought I should mention it at least.

laserpants commented 1 year ago

@peterhellberg: One example would be if you have a solution stack where the back-end application generates IDs from (potentially) large integers, and these IDs are then passed via the URL down to a front-end app, written in another language. Clearly, there is a bug here if the back-end is generating IDs from integers that are too large for the front-end to handle, but it is still useful to be able to distinguish between this situation and IDs that are algorithmically invalid.

aradalvand commented 1 year ago

This is absolutely genius and perfect. I'm totally on board. Solves precisely the few qualms I had with the previous algorithm (i.e. rough min length handling and growing blocked ID re-generation); and does so brilliantly. Thank you @4kimov! Pure perfection.

Removing minLength and maxLength (or at least making them optional) also sounds like a good idea; having worked on the .NET implementation, I would say it did feel like they were quite useless.

I'll start re-working sqids-dotnet as soon as this gets finalized.

4kimov commented 1 year ago

@antimon2 Ok, I see. Yes, it sounds like minValue/maxValue should be optional, so looks like we're leaning in the direction of leaving them out of the spec. Introducing a compatibility mode might be tricky, because JS is not the leading implementation (it's just one of many), so then you'd have to either pick one "to look up to" or assume there is a max limit in the spec (and there's not). IMO ideally each implementation would either a) support the highest int possible or b) support a range of ints, if the language allows.


@miquelfire I see, makes sense. Sounds like from 32/64 argument alone, PHP implementation might benefit from exposing maxValue to the user (again, voluntarily). Thank you for elaborating.


@laserpants Before we discuss overflow, I figured it's important to clarify: In the playground when it says "Error: Invalid ID" - it's a very poor description of a message that comes from re-encoding decoded numbers, and seeing that it doesn't match the input ID. Basically a simple way of saying "something went wrong, but we don't know what".

@peterhellberg Like @laserpants has mentioned, I've also seen people talk about how they might encode with one language, and decode with another. I wouldn't be surprised to find these use-cases in larger systems.

@laserpants Thanks for the code samples, they were helpful. First of all, I don't think the maxValue check would be that big of a performance hit (at least in js). Second of all, right now in JS, only encode could throw an error, and decode never does (it just returns an empty array), so I do like the fact that this would make two APIs more uniform. Also, a fan of giving more descriptive error messages. At the same time, not all languages would be able to throw that error because in many, you don't have access to "max value + 1", so that extra check is not an option.

If I were to design a split system like above, and if my decoding microservice would get the "Number out of range" error message, there's nothing I'd be able to programmatically do to address the issue (other than notifying APM / DevOps). That means my encoding microservice went over the boundaries of what my decoding microservice is capable of handling, which theoretically should be my fault for designing it this way. IMO, the proper way to design it would be to set an upper limit on the encoding microservice outside of the library - up to the number the decoding microservice can support.

So, the above makes me think that it'd be helpful in finding out what the error is, but not much else. Which kind of circles me back to proper documentation on int limits. I'd appreciate some feedback on this in case I'm missing the point.


@aradalvand Thank you for the kind words, and the efforts.

peterhellberg commented 1 year ago

@4kimov My point was not that you would use a single language, but rather that you would be able to control the range of numeric values encoded into ids used in your own system, even if using more than one language.

miquelfire commented 1 year ago

With the logic we're using for the PHP library (I just looked at the code), JavaScript might want the maxValue function to return Number.MAX_SAFE_INTEGER even if it used BigInt to avoid overflow issues.

4kimov commented 1 year ago

@peterhellberg Ah I see, sorry, I misunderstood. Good, thanks for clarifying.


@miquelfire Yeah, BigInt in JS is something I might need help with.


Just pushed a few more cleanup items:

antimon2 commented 1 year ago

I've reviewed the new code and specifications thoroughly, and I largely agree with them. I also understand and concur with the philosophy that "upper limit of value should be set outside the library."

@4kimov There's just one point I'd like to raise regarding the code (and the specifications presented at the beginning of this issue). There's a mention about increasing the min length limit (1_000 for example), and I've noticed the magic number 1_000 written in the code and tests. To be honest, this feels awkward. I believe this number doesn't have any special significance (I understand it's just an example). Wouldn't it be more appropriate to define a static constant indicating the "max limit for the minimum length" (e.g., const MIN_LENGTH_LIMIT = 1_000;) and use that instead? If, for any reason, the specifications change and the value needs to be adjusted, we'd only have to modify the constant's definition.

4kimov commented 1 year ago

@antimon2 Thank you for reviewing. Do you mean that the variable should be set in one place for the purposes of spec's readability & cleanliness, or for the purposes of exporting it to the end-user so they have access to it?

antimon2 commented 1 year ago

@4kimov It's the former. I think no need to export it (I forgot to mention it earlier).

4kimov commented 1 year ago

@antimon2 Ah ok. Yeah, that spec repo is optimized for readability, so it'll have a few awkward places I'm sure. I cleaned it up a bit (it's set twice only because I don't want to export it from the library to give the impression it's needed externally).

antimon2 commented 1 year ago

@4kimov I misunderstood one thing. The sqids-spec code is merely a specification, so it doesn't necessarily prioritize practicality or performance, correct?

One more point to confirm: Is MIN_LENGTH_LIMIT necessary? At the beginning of this issue, an improvement is described as follows:

| 7. Increase min length limit from alphabet-length to an arbitrary value

It is added "(1_000 for example)," but as I've reiterated, this is just an example, right? Is there a need to set a specific value? My understanding is that the practical upper limit for minLength has effectively been removed in the algorithm. So, shouldn't this also be made optional like maxValue()? In other words, can we see it as follows?:

4kimov commented 1 year ago

@antimon2 My train of thought was simple: I can't imagine anyone needing super long IDs. Especially, since the library's goal is to generate short IDs, not long ones. So, I'm not even sure why someone would set it to 1_000 (to me personally, it becomes useless after 6 or 8).

We can certainly entertain the idea of raising it or dropping it, but what would be the benefit?

antimon2 commented 1 year ago

@4kimov I see no benefit in increasing the limit. I never said "it would be better to increase it." The benefit of removing the check is that "it simplifies the specifications", although it might be so minimalbenefit.

I also can't imagine using this tool to generate long IDs. My main points were:

Especially regarding the latter, there used to be a need to check since minLength didn't work if it wasn't below alphabet.length. Now that this constraint has been lifted, I wonder why there's still a need to check the upper limit. I think your reply means that "there's a reason to check." Okay, let's assume for a moment that the goal is to generate short IDs, which I agree with. Is the number 1_000 appropriate for that? I've repeatedly asked, "This number is just an example, right?" but I haven't received a clear answer. Intuitively, I think it's "too large." I believe 256 would suffice, or perhaps after researching general limitations, like the "maximum number of characters allowed in a URL (GET protocol)," a more suitable value could be determined. Was there any such process? If not, who will decide and how?

If the approach is "let's just start with 1_000 and check," then okay, I accept that as the specification and will proceed with my work accordingly.

aradalvand commented 1 year ago

I also think it makes sense to just drop the check as it's technically no longer necessary — whereas it used to be in the previous algorithm. Otherwise, any number we choose would be arbitrary and subject to "why this one?". To my understanding, there's not really an actual limit as far as the new algorithm is concerned, so the check just seems like some residue from the previous algorithm, which we could simply get rid of.

We can certainly entertain the idea of raising it or dropping it, but what would be the benefit?

Well, the benefit to dropping it is that it would simplify the spec + code, no?

4kimov commented 1 year ago

@antimon2 Sorry, didn't mean to ignore your questions. Ultimately, your last sentence was what I was trying to convey: "let's just start with 1_000 (or whatever number you guys think is more appropriate) and see how it goes". I also like 255 more, because it can fit into u8 - so I think it's totally reasonable to change from 1_000 to 255 for example.

Is there any significance to the number 1_000?

Nope. Was just a nice round number.

Is there any significance to having a limit and checking it in the first place? (cc @antimon2) Well, the benefit to dropping it is that it would simplify the spec + code, no? (cc @aradalvand)

I'll answer these two together, as these are the things I'm concerned about:

Anyway, as far as anyone asking "why such a number?" I don't have a problem answering it was chosen to be a reasonable default as a limit.


For now, I'll lower the number to 255 (I like that idea @antimon2) that way the implementations that support u8 specifically don't even have to check whether the minLength overflows.

At the end of the day, if users find some legit reason to need a bigger limit, it won't be that much work for all of us to raise it to u16 or something similar.

aradalvand commented 1 year ago

Ah, good points @4kimov, hadn't considered those. Then yes, I guess it does make sense to have it. A smaller number (255 specifically) sounds good too — in .NET, for instance, we'll be able to use byte to represent that, eliminating the need for a runtime check, as you alluded to.

antimon2 commented 1 year ago

@4kimov Thank you for your clear and detailed response. I fully understand and appreciate the points made. Your reply is that I wanted to be informed, about these language-specific issues and universally applicable challenges, so it has been extremely valuable.

For the Julia implementation I'm working on, I will proceed with const MIN_LENGTH_LIMIT = 255. Julia does have a UInt8 type, but given the nature (or perhaps, the character) of the language, it is generally preferred to use the system's Int type when representing 'length' or 'size' in Julia. Therefore, I will be implementing the minLength with the type Int, and ensure it adheres to the condition 0 ≤ minLength ≤ MIN_LENGTH_LIMIT (this is already implemented locally).

Just for reference, in Julia, there is no explicit specification on string length, and it is mentioned that it can handle "a string as long as memory permits". Indeed, if one tries to generate a string of an exceedingly large size (for example, 2^53B = 8PB), the library will reject it and throw an OutOfMemoryError. Even if I incrementally increase the size, it can allocate up to about 2^33B = 8GB, but beyond that, it depends on the PC's onboard memory and will eventually be killed by the OS.

4kimov commented 1 year ago

Good. Thank you all for the comments and the feedback.

I'll merge these changes to main so they become official. I'm happy that we got to fine-tune this thing a bit better and hoping that this is the last breaking change we make.

I'll update the implementations that I've done over the next few days, feel free to join me with yours. Thank you for your continued work and efforts. To make it a bit easier - majority of the changes can be seen in this commit: https://github.com/sqids/sqids-spec/commit/dbb119a3b6a5467ff61881dc145d9f81298f72d3 (the rest are minor changes that can be seen in the main branch).

P.S: @laserpants I'll close this issue for now, but let me know if you'd like to discuss max integer encoding further.

4kimov commented 1 year ago

@aradalvand @laserpants @antimon2 @peterhellberg

Minor FYI: feel free to remove the "uniques" test from the individual repos (if you'd like). I've noticed most of us have been lowering it just so development goes quicker.

I've made the spec's uniques test a bit more comprehensive and it runs for a while. Since it tests the algorithm itself, it should be safe to remove from individual implementations.