Support for larger version numbers

Nemo157 commented 4 years ago

While testing this against an existing database I ran into issues because of the 31-bit limit. There were two main culprits I found:

people publishing with embedded timestamps (1.0.20190709015154)
stress testing (999999999.999999999.9999999991, 12345678901234567890.12345678901234567890.12345678901234567890)

What are your thoughts on providing support for larger version numbers, maybe by having a separate semver64 data type?

theory commented 4 years ago

I have no objection if someone wanted to make something like that; I don't need it, myself (and if I ever did want to use a timestamps I'd use epoch time).

jwdonahue commented 4 years ago

All numeric fields should be treated as a string of decimal numbers, not converted to fixed width binary data types. The whole point of the language in the spec, is that the ASCII characters [0..9] sort correctly without conversion to integer values.

troian commented 4 years ago

It makes sense to remove the 31-bit limitation but just curious, why not put timestamp into the metadata? For example 1.0.2+20190709015154 looks more convenient comparing to 1.0.20190709015154. Moreover, it meets specification in patch increment, next bump will be +1, not timestamp value.

Regarding actual string and comparing them. Even though spec implies using them due to natural comparison, using them will impact performance I believe. Comparing 999999999 > 999999998 as ints vs string has huge difference. In the first case there will be simple cmp cpu instruction, with strings it has to call strcmp which is way slower. Then multiply it to couple million records (surely edge case, who gonna have 1m version in the database)

Nemo157 commented 4 years ago

These are examples of versions I found in our database ultimately derived from published Rust crates. You would have to ask the developers of susudb why they chose such a versioning system.

jwdonahue commented 4 years ago

You generally see time stamps in the patch field for developer and test build systems, but they also have a -a.Dev or -a.Test prerelease tag.

troian commented 4 years ago

Looks odd tbh, and unfortunately, it does not contradict with the spec. But, for developers and tests, there are 2 fields called prerelease and build (aka meta) that cover all those cases.

jwdonahue commented 4 years ago

In the absence of a centralized version string server, it's the best way to avoid collisions.

theory commented 4 years ago

All numeric fields should be treated as a string of decimal numbers, not converted to fixed width binary data types. The whole point of the language in the spec, is that the ASCII characters [0..9] sort correctly without conversion to integer values.

Yesh, that would require a pretty significant revamp, as data is stored in databases in the current format:

https://github.com/theory/pg-semver/blob/e8e0b11a38d0e230cdaf7938851ded6af9d29f63/src/semver.c#L54-L63

Upgrading would be a PITA for existing implementations, I would think. Cleanest approach would be to create a new data type, as @Nemo157 originally suggested, but using text instead, as you suggest, @jwdonahue.

I think I can get most of the other open issues cleaned up for the current design, though.

jwdonahue commented 4 years ago

@troian wrote: (surely edge case, who gonna have 1m version in the database)

To date, Microsoft has built more than 20K versions (probably a lot more) of the NT Kernel that Windows is based on (I've worked there off and on since 1987 and most of the last two decades). Now that they got the continuous delivery bug, that number has begun to grow exponentially. They do not use SemVer, they have historically used build numbers, but my point is, they will easily surpass a million versions in the next decade. My current build number is 18363.778. Notice the number of digits? For 18363, there were 778 revisions built internally, distributed to dog-fooders (ring 0 flight testing), then out to the various rings of their "insider's" program.

Everyone is moving to CI/D these days. There are online API's that get rev'd multiple times per day. For each of those revs, there were anywhere from a handful of internal builds to hundreds of them. The old ways of only bumping the product version after marketing has approved it, are gone. Today, we build product, publish it to our test labs, then publish it to our early internal flight systems, then take it out to the "outer rings" where random samples of customers get different feature mixes and our AI's monitor customer feedback, crash dumps, remote telemetry, etc.

There are hundreds of companies in the world today, knocking out new versions of their services/products every few minutes. Many of them have parallel CI/D systems capable of producing dozens of versions every few minutes. Most SemVer implementations do not scale to that level, because of short-sighted thinking.

Whatever you think is enough versions today, it won't be tomorrow. I am old enough to remember when Microsoft decided that 640K of memory was all anybody would ever need (my home brew systems had between 2 and 64K at that time). The SemVer spec doesn't say the fields are int16 or even uint128, it says they are numeric strings. There's a reason for that. Using the Arabic numerals 0..9, it is possible to write down any number imaginable (given enough time or automation).

The cool thing about strings is, you can compress them. Numeric fields can easily be converted to base 64, 128 or 256. There are large number libraries for C that can do even better and because they convert the string to a packed binary format, you can compare chunks of them in 32, 64 or 128 bit registers. For those who don't need really big numeric fields, there's no perf hit.

I would add that most modern databases are very good at sorting strings. Probably better than anything you can do with a hand coded C plug-in. Unless you're really dedicated to squeezing out every last bit of perf, in which case, your code is going to be ugly and hard to maintain.

First make it correct, then profile it for performance bottle-necks, then optimize wherever it makes sense.

troian commented 4 years ago

Just to make clear myself. Unfortunately, the proposal does not contradict the specification, so frankly, it has to support it in full. What I'm trying to say is quite often these days people are trying to make a marriage between Alligator and the bus stop.

theory commented 4 years ago

I concur with @jwdonahue here, and wish now I'd had that understanding earlier on, but in truth, between @samv, @tdavis, @maspalio and myself, I don't think any of us had an inkling. int32 seemed plenty big for all the versions one would ever need. 🤦.

At any rate, I'm going to put some time into getting the full SemVer corpus modulo larger numbers working with the current design today. Then we need to have a discussion on how to update the internal storage to support larger numbers and store them as text (or NUMERIC?). Some questions off the top of my head:

Is there an existing blessed C implementation we can (re)build on? Maybe /h2non/semver.c?
Would changing the internal storage format require a new implementation of the data type? I'm thinking maybe not, given that it's defined as a pointer to a variable length data structure:
```
#define PG_GETARG_SEMVER_P(n) (semver *)PG_GETARG_POINTER(n)
```
If not, can the new format be used for new semvers and continue to recognize and properly handle existing semvers?

I think this might all be do-able, but my C knowledge is miniscule (the original implementation was a domain with PL/pgSQL validation function), and my time is rather limited these days. I'd super appreciate any and all assistance; I do want this thing to be as correct as possible.

jwdonahue commented 4 years ago

Does your current data blob have a version tag embedded in it? Designing for future back-compat generally requires some way to identify which version of code should process the data. If not, all isn't lost, as you can design your new data structure to have a signature that is unlikely to collide with any of your current data.

In your case, you could zero out vllen for all new records and follow that with a string of the form "pgSem2" that identifies the version of the record, followed by another unsigned integer field for length. Signed values for length, aren't very useful.

jwdonahue commented 4 years ago

@troian, see VersionMeta and VersionSchema. I have been working off and on for the past couple of years or so, to provide tooling that lets us adapt our versioning schemes to our processes and still make them human and machine readable.

jwdonahue commented 4 years ago

@theory, I'll take a look at your code later today. I am currently semi-retired, but also now the primary day-care provider for my 6 month old grand-daughter. My daughter is a pediatrician on the front-lines these days. It's very difficult for me to focus on anything for more than a highly variable nap time.

theory commented 4 years ago

Oh wow @jwdonahue, your daughter is a frigging hero, and so are you! No rush, I started this thing in 2010 after all. :-)

theory commented 4 years ago

Added the test corpus and fixed the issues there in PR #50 (which also includes the changes from PR #49).

maspalio commented 4 years ago

About h2non/semver.c, sounds like project is also facing legacy int vs proposed uint32 types for storage purposes. Please refer to h2non/semver.c/issues/26 for more.

bigsmoke commented 1 year ago

In case anybody needs bigger versions and is willing to sacrifice some speed, I've just released a similar extension with a simple text domain-based semver type (including the requisite amenities to compare versions): https://github.com/bigsmoke/pg_text_semver

theory / pg-semver

Support for larger version numbers #47