Closed 4kimov closed 1 year ago
minValue
and maxValue
I think maxValue
should be an optional function. If a library is using, for example, 32-bit integers, having maxValue
return a value that will not break the library would be nice to know how big a number can be before an overflow error occurs. For libraries that use a number system that is a bit insane (the PHP library has BCMath and GMP, and JavaScript could use BigInt whose max value would require over 2KB of memory to store!), they don't need maxValue
This sounds reasonable. I am :ok: with those changes.
I am wondering about something, though: Since each language implementation defines its own maxValue
(even if it is implicit), you can currently encode a sequence of numbers in one language, into an ID which then subsequently fails to decode from another language implementation (using the same alphabet and configuration). Could this be a problem?
For example, using the default config in the Haskell version I can generate an ID:
sqids (encode [9223372036854775807])
Right "Ack7vOaIrINC"
But Ack7vOaIrINC
is an invalid ID in the JS version.
Update: The below idea probably doesn't solve anything, since the fundamental problem here is more about integer precision, i.e., unless the number is encoded as a string, there would need to be a suitable data type (BigInt
) to hold the number in the first place, in which case that type can probably be used directly for encoding/decoding. Not sure how this would work otherwise.
encode(int)
encodeBig(BigInt)
So maybe not. :thinking: But nevertheless, it could be nice to distinguish, when decoding, between invalid IDs, and IDs that are (technically) valid, but too large for the language to handle.
Would it be possible, and useful, to support arbitrarily large numbers by introducing a preliminary step such as the following:
<= 999999999
.)The number in my previous example, 9223372036854775807
, would then be partitioned as:
922337203
685477580
7
With the default alphabet, a
becomes the separator and the remaining characters bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
are then used to encode these numbers. With the old algorithm, we get:
9DVndvc
OSFDIQz
6B
These IDs are then combined into the final ID, intercalating the a
:
9DVndvcaOSFDIQza6B
@peterhellberg Cool
@miquelfire Question for you. For PHP version, either one of the two extensions are required, but we still use PHP_INT_MAX
for max value. Is my understanding correct that the value will dynamically change based on the system/extensions? If yes, sounds like maxValue()
would be useful. And maybe you're right - perhaps it's worth making that function optional.
@laserpants Yeah, good thought process. This was actually considered in the very first version of Hashids - what if we split up large numbers and use a dedicated separator. But then the cost slowly started outweighing the benefit:
That's why we kinda rolled back into keeping the lib clean and simple, and letting the user handle upper limits of ints.
Edit: And to answer your question about Haskell version generating an ID that JS can't decode, to me it doesn't feel like it should be a library's issue because if the user has two systems (one running encoding via Haskell, the other decoding via JS), they should account for scenarios like these on their end. It would be our problem if we didn't communicate clearly what each language supports.
Good morning everyone (it's morning here in Japan).
Regarding maxValue()
.
I agree with "making that function optional". In fact, in my Julia implementation, by default, it processes within the range of the system's Int
type (Int64
) and maxValue()
is set to return typemax(Int)
(the maximum value of the Int
type). However, with a custom extension, passing the option strict=false
during configuration will remove the upper limit (using number types like Int128
or BigInt
) and the implementation will throw a NoMethodError
for maxValue()
.
As a further extension, I had considered it might be interesting to "specify a switch that represents the maximum value or range of values during configuration". For example, if you set mode=:compatible
, it would handle only integers within the range JS(TS) can handle (53bits), and with mode=:default
(the default setting), it would handle the range of integer types that language can handle (not extended precision). With mode=:unlimited
, it would utilize BigInt to remove the upper limit (this would be invalid for languages that don't support it), and so on.
I will check the new code and specifications when I have time.
PHP_INT_MAX
is from before Hashids required one of the two extensions. What it's defined as depends on if you compiled it as 32-bit or 64-bit. None of the bcmath
or GMP
code will even need to look at it. Though, if the value passed to the encode function is above PHP_INT_MAX
it might convert to float and lose precision if it wasn't a string value to begin with.
That's why we kinda rolled back into keeping the lib clean and simple, and letting the user handle upper limits of ints.
Thanks for clarifying this @4kimov. I had a feeling that this is something you'd thought about already. I agree that it should be up to the user to account for these inconsistencies. But doesn't this mean that we still need to make it possible to distinguish between invalid IDs and IDs that fail to decode because of overflow. What if the implementation of toNumber
had something like the following:
private toNumber(id: string, alphabet: string): number {
const chars = alphabet.split('')
return id.split('').reduce((a, v) => {
const u = a * chars.length + chars.indexOf(v);
if (u > this.maxValue()) {
throw 'Number out of range';
}
return u;
}, 0)
}
In client code, one could then do:
function checkId(id: string) {
try {
const numbers = sqids.decode(id);
if (!numbers.length) {
console.log('Invalid ID')
} else {
console.log(numbers)
}
} catch (e) {
// Number out of range
console.error('Error: ', e);
}
}
checkId("8QRLaD") // [ 1, 2, 3 ]
checkId("adfadsfa") // Invalid ID
// ID from my previous example, encoded via Haskell
checkId("Ack7vOaIrINC") // Error: Number out of range
If this creates a performance hit, then another option is to not do this check in toNumber
, and instead introduce a new method canDecode(id)
. Users can then choose to test their IDs for overflow prior to decoding, if this is relevant to their use-case. For example;
if (sqids.canDecode(id)) {
const numbers = sqids.decode(id);
// Do stuff with numbers
} else {
// At least one number in the sequence is too large for this language.
}
I'm a bit concerned about this turning into an XY-problem, do we even want to be able to decode from any unknown source? My understanding is that you are primarily meant to decode "your own" ids, making the cross language argument a bit less critical. (Since you then fully control what numbers you are encoding into your ids)
Also, we are still talking about unsigned integers, right? From the point of view of Go that would be being able to encode/decode []uint64
, which might not be fully representable in let's say JavaScript. (Without resorting to BigInt
that is)
Another approach would be to decide (in the spec) on some lowest common denominator for the maximum value all implementations should be able to encode/decode, and then not support languages with smaller (unsigned) integer types. This is not an avenue I'd prefer to go down, but thought I should mention it at least.
@peterhellberg: One example would be if you have a solution stack where the back-end application generates IDs from (potentially) large integers, and these IDs are then passed via the URL down to a front-end app, written in another language. Clearly, there is a bug here if the back-end is generating IDs from integers that are too large for the front-end to handle, but it is still useful to be able to distinguish between this situation and IDs that are algorithmically invalid.
This is absolutely genius and perfect. I'm totally on board. Solves precisely the few qualms I had with the previous algorithm (i.e. rough min length handling and growing blocked ID re-generation); and does so brilliantly. Thank you @4kimov! Pure perfection.
Removing minLength
and maxLength
(or at least making them optional) also sounds like a good idea; having worked on the .NET implementation, I would say it did feel like they were quite useless.
I'll start re-working sqids-dotnet as soon as this gets finalized.
@antimon2 Ok, I see. Yes, it sounds like minValue/maxValue
should be optional, so looks like we're leaning in the direction of leaving them out of the spec. Introducing a compatibility mode might be tricky, because JS is not the leading implementation (it's just one of many), so then you'd have to either pick one "to look up to" or assume there is a max limit in the spec (and there's not). IMO ideally each implementation would either a) support the highest int possible or b) support a range of ints, if the language allows.
@miquelfire I see, makes sense. Sounds like from 32/64 argument alone, PHP implementation might benefit from exposing maxValue
to the user (again, voluntarily). Thank you for elaborating.
@laserpants Before we discuss overflow, I figured it's important to clarify: In the playground when it says "Error: Invalid ID" - it's a very poor description of a message that comes from re-encoding decoded numbers, and seeing that it doesn't match the input ID. Basically a simple way of saying "something went wrong, but we don't know what".
@peterhellberg Like @laserpants has mentioned, I've also seen people talk about how they might encode with one language, and decode with another. I wouldn't be surprised to find these use-cases in larger systems.
@laserpants Thanks for the code samples, they were helpful. First of all, I don't think the maxValue
check would be that big of a performance hit (at least in js). Second of all, right now in JS, only encode
could throw an error, and decode
never does (it just returns an empty array), so I do like the fact that this would make two APIs more uniform. Also, a fan of giving more descriptive error messages. At the same time, not all languages would be able to throw that error because in many, you don't have access to "max value + 1", so that extra check is not an option.
If I were to design a split system like above, and if my decoding microservice would get the "Number out of range" error message, there's nothing I'd be able to programmatically do to address the issue (other than notifying APM / DevOps). That means my encoding microservice went over the boundaries of what my decoding microservice is capable of handling, which theoretically should be my fault for designing it this way. IMO, the proper way to design it would be to set an upper limit on the encoding microservice outside of the library - up to the number the decoding microservice can support.
So, the above makes me think that it'd be helpful in finding out what the error is, but not much else. Which kind of circles me back to proper documentation on int limits. I'd appreciate some feedback on this in case I'm missing the point.
@aradalvand Thank you for the kind words, and the efforts.
@4kimov My point was not that you would use a single language, but rather that you would be able to control the range of numeric values encoded into ids used in your own system, even if using more than one language.
With the logic we're using for the PHP library (I just looked at the code), JavaScript might want the maxValue
function to return Number.MAX_SAFE_INTEGER
even if it used BigInt to avoid overflow issues.
@peterhellberg Ah I see, sorry, I misunderstood. Good, thanks for clarifying.
@miquelfire Yeah, BigInt in JS is something I might need help with.
Just pushed a few more cleanup items:
minLength
.I've reviewed the new code and specifications thoroughly, and I largely agree with them. I also understand and concur with the philosophy that "upper limit of value should be set outside the library."
@4kimov There's just one point I'd like to raise regarding the code (and the specifications presented at the beginning of this issue). There's a mention about increasing the min length limit (1_000 for example), and I've noticed the magic number 1_000
written in the code and tests. To be honest, this feels awkward. I believe this number doesn't have any special significance (I understand it's just an example). Wouldn't it be more appropriate to define a static constant indicating the "max limit for the minimum length" (e.g., const MIN_LENGTH_LIMIT = 1_000;
) and use that instead? If, for any reason, the specifications change and the value needs to be adjusted, we'd only have to modify the constant's definition.
@antimon2 Thank you for reviewing. Do you mean that the variable should be set in one place for the purposes of spec's readability & cleanliness, or for the purposes of exporting it to the end-user so they have access to it?
@4kimov It's the former. I think no need to export it (I forgot to mention it earlier).
@antimon2 Ah ok. Yeah, that spec repo is optimized for readability, so it'll have a few awkward places I'm sure. I cleaned it up a bit (it's set twice only because I don't want to export it from the library to give the impression it's needed externally).
@4kimov I misunderstood one thing. The sqids-spec code is merely a specification, so it doesn't necessarily prioritize practicality or performance, correct?
One more point to confirm: Is MIN_LENGTH_LIMIT
necessary?
At the beginning of this issue, an improvement is described as follows:
| 7. Increase min length limit from alphabet-length to an arbitrary value
It is added "(1_000 for example)," but as I've reiterated, this is just an example, right? Is there a need to set a specific value? My understanding is that the practical upper limit for minLength
has effectively been removed in the algorithm. So, shouldn't this also be made optional like maxValue()
? In other words, can we see it as follows?:
@antimon2 My train of thought was simple: I can't imagine anyone needing super long IDs. Especially, since the library's goal is to generate short IDs, not long ones. So, I'm not even sure why someone would set it to 1_000 (to me personally, it becomes useless after 6 or 8).
We can certainly entertain the idea of raising it or dropping it, but what would be the benefit?
@4kimov I see no benefit in increasing the limit. I never said "it would be better to increase it." The benefit of removing the check is that "it simplifies the specifications", although it might be so minimalbenefit.
I also can't imagine using this tool to generate long IDs. My main points were:
1_000
?Especially regarding the latter, there used to be a need to check since minLength
didn't work if it wasn't below alphabet.length
. Now that this constraint has been lifted, I wonder why there's still a need to check the upper limit.
I think your reply means that "there's a reason to check." Okay, let's assume for a moment that the goal is to generate short IDs, which I agree with. Is the number 1_000
appropriate for that? I've repeatedly asked, "This number is just an example, right?" but I haven't received a clear answer. Intuitively, I think it's "too large." I believe 256
would suffice, or perhaps after researching general limitations, like the "maximum number of characters allowed in a URL (GET protocol)," a more suitable value could be determined. Was there any such process? If not, who will decide and how?
If the approach is "let's just start with 1_000
and check," then okay, I accept that as the specification and will proceed with my work accordingly.
I also think it makes sense to just drop the check as it's technically no longer necessary — whereas it used to be in the previous algorithm. Otherwise, any number we choose would be arbitrary and subject to "why this one?". To my understanding, there's not really an actual limit as far as the new algorithm is concerned, so the check just seems like some residue from the previous algorithm, which we could simply get rid of.
We can certainly entertain the idea of raising it or dropping it, but what would be the benefit?
Well, the benefit to dropping it is that it would simplify the spec + code, no?
@antimon2 Sorry, didn't mean to ignore your questions. Ultimately, your last sentence was what I was trying to convey: "let's just start with 1_000 (or whatever number you guys think is more appropriate) and see how it goes". I also like 255 more, because it can fit into u8
- so I think it's totally reasonable to change from 1_000
to 255
for example.
Is there any significance to the number 1_000?
Nope. Was just a nice round number.
Is there any significance to having a limit and checking it in the first place? (cc @antimon2) Well, the benefit to dropping it is that it would simplify the spec + code, no? (cc @aradalvand)
I'll answer these two together, as these are the things I'm concerned about:
u128
and another u64
for that param, and now a user is stuck because they need the other one raised to u128
too (and the spec doesn't have any guidance on this)u128
?minLength
to one million, almost any number I try to encode gets you the max generation attempts error, so I can already see incoming Github issues on that frontNumber.MAX_SAFE_INTEGER
for minLength
, it eventually throws an error Invalid string length
and I honestly don't know what that's about (probably something overflows somewhere)Anyway, as far as anyone asking "why such a number?" I don't have a problem answering it was chosen to be a reasonable default as a limit.
For now, I'll lower the number to 255
(I like that idea @antimon2) that way the implementations that support u8
specifically don't even have to check whether the minLength
overflows.
At the end of the day, if users find some legit reason to need a bigger limit, it won't be that much work for all of us to raise it to u16
or something similar.
Ah, good points @4kimov, hadn't considered those. Then yes, I guess it does make sense to have it.
A smaller number (255 specifically) sounds good too — in .NET, for instance, we'll be able to use byte
to represent that, eliminating the need for a runtime check, as you alluded to.
@4kimov Thank you for your clear and detailed response. I fully understand and appreciate the points made. Your reply is that I wanted to be informed, about these language-specific issues and universally applicable challenges, so it has been extremely valuable.
For the Julia implementation I'm working on, I will proceed with const MIN_LENGTH_LIMIT = 255
. Julia does have a UInt8
type, but given the nature (or perhaps, the character) of the language, it is generally preferred to use the system's Int
type when representing 'length' or 'size' in Julia. Therefore, I will be implementing the minLength
with the type Int
, and ensure it adheres to the condition 0 ≤ minLength ≤ MIN_LENGTH_LIMIT
(this is already implemented locally).
Just for reference, in Julia, there is no explicit specification on string length, and it is mentioned that it can handle "a string as long as memory permits". Indeed, if one tries to generate a string of an exceedingly large size (for example, 2^53B = 8PB), the library will reject it and throw an OutOfMemoryError
. Even if I incrementally increase the size, it can allocate up to about 2^33B = 8GB, but beyond that, it depends on the PC's onboard memory and will eventually be killed by the OS.
Good. Thank you all for the comments and the feedback.
I'll merge these changes to main
so they become official. I'm happy that we got to fine-tune this thing a bit better and hoping that this is the last breaking change we make.
I'll update the implementations that I've done over the next few days, feel free to join me with yours. Thank you for your continued work and efforts. To make it a bit easier - majority of the changes can be seen in this commit: https://github.com/sqids/sqids-spec/commit/dbb119a3b6a5467ff61881dc145d9f81298f72d3 (the rest are minor changes that can be seen in the main branch).
P.S: @laserpants I'll close this issue for now, but let me know if you'd like to discuss max integer encoding further.
@aradalvand @laserpants @antimon2 @peterhellberg
Minor FYI: feel free to remove the "uniques" test from the individual repos (if you'd like). I've noticed most of us have been lowering it just so development goes quicker.
I've made the spec's uniques test a bit more comprehensive and it runs for a while. Since it tests the algorithm itself, it should be safe to remove from individual implementations.
Since the initial library has been out for a few weeks, I got to mess around with it a bit more and realized there are a few rough edges we could iron out:
minValue
andmaxValue
are not that useful, let's remove? (cc @peterhellberg)Pros:
Cons:
Unknowns:
Current: https://sqids.org/playground (algorithm code / explanation) Proposed: https://sqids.org/playground/simpler (algorithm code / explanation)
P.S: Also worth noting that I have no intention of changing the algorithm over and over; I figured this might be cleaner + address some current issues/questions, while we're still technically mostly pre-prod.
Thoughts/feedback?
cc @laserpants @vinkla @niieani @antimon2