theory / pg-semver

A semantic version data type for PostgreSQL
Other
145 stars 24 forks source link

Support for SemVer 1.0.0-beta should be dropped. #52

Closed jwdonahue closed 4 years ago

jwdonahue commented 4 years ago

Anyone who took a dependency on v1-beta should have long ago upgraded to 1.0.0 of the spec. It's been more than 10 years. Converting any existing v1-beta style records to v1 conformance should be trivial, and while it might break some things, it's highly unlikely and will be easily worked-around.

With that out of the way, it leaves us with the one real defect in SemVer, and that is the lack of any unambiguous means to differentiate between a SemVer 1.0.0 string and a SemVer 2.0.0 string, but that's trivial because v1 is a subset of v2. So going forward, you only deal with v2 of standard.

Supporting v1-beta adds a lot of extra complexity to an already complex parsing problem. Complexity == bugs and higher maintenance costs. Let's kill it.

jwdonahue commented 4 years ago

So, I am working on some C code that could eventually be rolled into this project and make it easier for me to solve your evolutionary issues around choice of int for the version triple fields. I will not support v1-beta, it just adds too many test cases and complicates the code.

theory commented 4 years ago

Make sense to me.

maspalio commented 4 years ago

Why not just use PCRE? I mean, Joseph++ seems to have authored a perfectly good one. Just thinking out loud...

jwdonahue commented 4 years ago

A regex doesn't go everywhere that plain old C can go. The only reason I got involved in the regex effort, was because there were so many SemVer regex's out there, that were just plain wrong. There are risks involved with using a regex. The biggest one is that it's possible to write a regex that just won't ever terminate on some inputs, or you have to use an implementation that has a built-in timeout. So they are inherently unreliable and virtually unverifiable.

Even for the Regex we came up with for SemVer, we had some ridiculously long test cases, because those were the easiest way we could find, to exploit certain non-determinism's across all the implementations we were testing against at the time. Combined with the test oracles, I am confident that they are a good point of reference, and should dispel the idea that a SemVer regex is trivial (it's definitely not). I am never confident that any regex is fit for mission critical code, despite my use of the them on dozens of occasions. I do try to limit their use to near trivial cases as much as possible.

That said, for database code, I would simply split the string into it's parts and store each of them in string fields of a database record. It just happens that with the limited ASCII/UTF-8 codes allowed in a SemVer string, numeric fields also sort correctly lexically. Basically, the ASCII/UTF-8 code for '1' just happens to be 1 + 49 in binary, so as long all the numeric fields are compared to each other in their original form, they sort correctly. Because '9' is less than any alpha character, digits sort lower. In other words, provided the no leading zeros in numeric fields rule is adhered to, you can run your sorts lexically by field (essentially discarding the dots), which just happens to be what database engines do with ease. At least, that's my current working theory. That just leaves us with the task of parsing efficiently. I think it depends on the database capabilities whether a plugin module or a stored procedure should be used.

And there's still the pure C environments where resorting to SQL statements or a regex is out of the question. All of the C semver implementations I can find, parse the string into a struct of binary and string fields, so I think I will provide an alternative implementation and test my operative theories. If all works out well, I can then re-implement it in C++ and C# as well. It's just about the right level of effort and concentration for my current working environment (day care provider for my 6 month old grand-daughter).

maspalio commented 4 years ago

Thank you very much Joseph for your very detailed answer, I appreciate. Makes sense of course. Kind of in line with "Regular Expressions: Now You Have Two Problems" mantra. Also, from a security perspective, this would add an extra dependency with its truckload of CVEs. We definitely do not want to add this extra burden and hazard to DB admin.

theory commented 4 years ago

I think it depends on the database capabilities whether a plugin module or a stored procedure should be used.

The beautiful thing about Postgres extensions is you can use, say, a C library with functions that do what you need, and map the function calls you need to Postgres functions.

so I think I will provide an alternative implementation and test my operative theories. If all works out well, I can then re-implement it in C++ and C# as well.

And we'd be able to just use it here, I think, especially if it's just a single C file. Might be worthwhile to replace the guts of the Perl Semver module, too.

It's just about the right level of effort and concentration for my current working environment (day care provider for my 6 month old grand-daughter).

Nice bit of distraction when you have a spare 30m here and there from your much more important day job! Thanks for all your effort and knowledge here, @jwdonahue.

theory commented 4 years ago

In other words, provided the no leading zeros in numeric fields rule is adhered to, you can run your sorts lexically by field (essentially discarding the dots), which just happens to be what database engines do with ease.

I'm perhaps I'm missing something, but given these two versions:

I would expect the first to be less than the second, but lexically, it's not:

$  perl -E 'say "309" lt "3010" ? "yes" : "no"'
no

Unless you also consider string length, which I assume would be the plan:

$  perl -E 'say length "309" < length "3010" || "309" lt "3010" ? "yes" : "no"'
yes

Apologies for the pedantry, just want to be sure I understand your thinking.

jwdonahue commented 4 years ago

Yes, I am classifying and parsing in a single function that produces an intermediate record, that can then be used to compare metadata describing the string layout, or used to quickly split it into fields. The concept is brain-dead simple and the algorithm is O(n), but the code complexity of the function is off the charts. The parser/classifier is a big state machine, implemented as a giant switch statement, with ten states/cases (~300 lines of code, with just enough vertical white space and plenty of comments). I haven't run a complexity analysis on it, but I am guessing it would come out at around 200..300 on the McCabe scale. I normally try to limit functions to <7, but this is a special case.

When a field matches in size, then I go in and compare the field content as efficiently as possible. Just going to make it work while providing some compiler optimizer opportunities and lots of chances for instruction pipeline reordering and speculative branch executions. Then later, if time permits, I'll try perf-testing some alternative approaches.


I would add that you have to compare each field, not the whole string. The spec is very clear about that, and if you pencil out some scenarios, you'll quickly discover why. Consider "10.0.1" and "1.0.10". Field lengths are key to avoiding character by character comparisons. Hence, I would have a database table with 5 columns and prerelease would be a key to another table where the individual fields would be stored, one per row (with sequence numbers).

theory commented 4 years ago

I haven't run a complexity analysis on it, but I am guessing it would come out at around 200..300 on the McCabe scale. I normally try to limit functions to <7, but this is a special case.

Parsers, ISTM, can be quite complex.

I assume there's one function to parse a semver, and another that compares two semvers, yes?

jwdonahue commented 4 years ago

Yes.

Am I to understand that you have access to a database of representative version string? I'd like to have some analysis on the number of major, minor and patch versions that have 1, 2, 3, etc, characters. I am guessing there's a lot of one or two digit majors, plenty of two or three digit minors and lots of three or four digit patch numbers? What's the average number of prerelease and buildmeta, dot separated identifiers? One or two for prerelease perhaps? I am basically interesting in a frequency distribution for tuning purposes.

theory commented 4 years ago

I do, from PGXN, but not that many:

$ select count(version), version from (select version from distributions union all select ext_version from distribution_extensions) f(version) group by version;
 count │    version     
───────┼────────────────
     4 │ 2.3.10
     3 │ 3.0.10
     2 │ 0.9.10
     2 │ 0.0.10
     2 │ 1.2.13
     2 │ 1.4.13
     2 │ 3.0.13
     1 │ 1.6.2
     1 │ 0.20.2
     5 │ 0.7.2
     6 │ 1.5.2
     2 │ 4.2.2
     2 │ 2.6.2
     6 │ 2.3.2
     4 │ 1.8.2
     2 │ 3.7.2
     4 │ 3.1.2
     2 │ 0.8.2
     7 │ 0.4.2
     2 │ 2.1.2
    14 │ 1.4.2
     2 │ 0.13.2
     4 │ 1.7.2
     8 │ 1.1.2
    28 │ 1.2.2
     2 │ 3.6.2
     4 │ 2.2.2
    11 │ 0.9.2
     2 │ 9.2.2
     2 │ 8.3.2
     2 │ 8.2.2
     2 │ 7.1.2
     2 │ 7.0.2
     2 │ 6.2.2
     2 │ 5.2.2
     5 │ 2.0.2
    11 │ 0.3.2
     9 │ 0.2.2
    44 │ 0.1.2
    74 │ 0.0.2
    26 │ 1.3.2
    50 │ 1.0.2
     2 │ 2.0.99
     2 │ 1.2.12
     2 │ 1.4.12
     2 │ 3.0.12
     2 │ 0.0.12
     2 │ 1.1.5
     4 │ 2.3.5
     2 │ 1.7.5
     2 │ 1.8.5
     6 │ 1.4.5
     2 │ 0.4.5
     2 │ 0.3.5
     2 │ 0.10.5
    19 │ 0.0.5
     4 │ 1.2.5
     5 │ 0.2.5
    14 │ 1.0.5
     6 │ 0.9.5
     9 │ 0.1.5
     6 │ 1.3.5
     2 │ 1.1.4
     4 │ 1.5.4
     2 │ 2.6.4
     2 │ 2.3.4
     2 │ 0.4.4
     2 │ 0.3.4
     2 │ 3.0.4
     8 │ 1.4.4
     2 │ 0.10.4
     2 │ 0.8.4
     4 │ 1.7.4
    24 │ 0.0.4
     8 │ 1.2.4
     8 │ 0.2.4
    17 │ 1.0.4
     5 │ 0.9.4
     2 │ 9.2.4
    27 │ 0.1.4
    10 │ 1.3.4
     2 │ 1.7.6
     2 │ 1.8.6
     2 │ 1.3.6
     2 │ 0.10.6
     9 │ 0.0.6
     4 │ 1.2.6
    10 │ 1.0.6
     5 │ 0.9.6
    13 │ 0.1.6
     2 │ 1.2.11
     4 │ 2.3.11
     2 │ 1.4.11
     1 │ 3.0.11
     2 │ 0.0.11
     2 │ 1.2.9
     2 │ 1.4.9
     2 │ 3.1.9
     2 │ 3.0.9
     2 │ 0.9.9
     2 │ 1.0.9
     2 │ 1.3.9
     7 │ 0.0.9
     2 │ 1.2.15
     2 │ 3.1.15
     1 │ 2.5.3
     1 │ 0.20.3
     3 │ 0.7.3
     4 │ 1.5.3
     3 │ 2.0.3
     2 │ 3.1.3
     6 │ 2.3.3
     2 │ 0.4.3
    12 │ 1.4.3
     2 │ 0.12.3
     2 │ 1.7.3
     8 │ 1.1.3
    18 │ 1.2.3
     2 │ 3.6.3
     7 │ 0.2.3
     4 │ 2.2.3
    40 │ 0.0.3
     7 │ 0.9.3
     2 │ 9.2.3
     2 │ 7.0.3
     2 │ 6.2.3
     5 │ 0.3.3
    29 │ 0.1.3
    16 │ 1.3.3
    25 │ 1.0.3
     2 │ 0.1.8
     2 │ 1.2.8
     2 │ 1.1.8
     2 │ 1.1.8-beta2
     2 │ 1.1.8-beta1
     2 │ 1.1.8-alpha1
     2 │ 3.1.8
     2 │ 3.0.8
     2 │ 0.9.8
     7 │ 0.0.8
     4 │ 1.0.8
     2 │ 1.0.0-beta3
     2 │ 1.0.0-beta2
     2 │ 1.0.0-beta
     2 │ 2.0.0-alpha1
     2 │ 0.30.0
     2 │ 0.22.0
     2 │ 0.21.0
     5 │ 0.20.0
     2 │ 0.17.0
     4 │ 1.5.0-dev1
     1 │ 1.1.0-beta1
     2 │ 2.4.0
     2 │ 1.0.0-dirty
     2 │ 3.7.0
     4 │ 0.99.0
     4 │ 0.98.0
     4 │ 0.97.0
     4 │ 0.96.0
     4 │ 0.95.0
     4 │ 0.94.0
     4 │ 0.93.0
     4 │ 0.92.0
     4 │ 0.91.0
     4 │ 0.90.0
     2 │ 0.25.0
     2 │ 1.3.0-beta1
     2 │ 1.2.0-beta1
     2 │ 4.2.0
     2 │ 2.6.0
     2 │ 1.8.0
     2 │ 1.0.0-b3
     2 │ 1.0.0-b2
     6 │ 4.0.0
     2 │ 3.9.0
    15 │ 0.7.0
     2 │ 1.0.0-beta1
     2 │ 1.17.0
     2 │ 1.16.0
     2 │ 1.15.0
     2 │ 1.14.0
     2 │ 1.13.0
     2 │ 1.12.0
     2 │ 1.11.0
     2 │ 1.10.0
     2 │ 1.9.0
     1 │ 0.6.0-release1
     1 │ 0.5.0-release1
     1 │ 0.3.0-alpha1
     2 │ 2.10.0
     2 │ 3.6.0
     2 │ 3.5.0
     2 │ 3.4.0
     4 │ 3.3.0
     8 │ 3.2.0
    10 │ 3.1.0
     6 │ 2.3.0
     6 │ 0.10.0
     4 │ 0.16.0
     4 │ 0.15.0
     3 │ 0.14.0
     7 │ 0.13.0
     6 │ 0.12.0
     6 │ 0.11.0
     8 │ 3.0.0
     8 │ 2.2.0
     5 │ 1.7.0
    13 │ 0.9.0
     9 │ 1.6.0
    15 │ 1.5.0
    28 │ 1.4.0
     2 │ 9.3.0
     2 │ 9.2.0
     2 │ 9.1.0
     2 │ 9.0.0
     2 │ 8.3.0
     2 │ 8.2.0
     2 │ 7.5.0
     2 │ 7.4.0
     2 │ 7.3.0
     2 │ 7.2.0
     2 │ 7.1.0
     2 │ 7.0.0
     2 │ 6.1.0
     2 │ 6.0.0
     2 │ 5.2.0
     2 │ 5.1.0
     7 │ 0.8.0
   120 │ 0.1.0
    18 │ 2.1.0
    26 │ 2.0.0
    12 │ 0.6.0
    24 │ 0.5.0
    33 │ 0.4.0
    39 │ 0.3.0
    56 │ 0.2.0
    63 │ 1.3.0
    85 │ 1.2.0
   115 │ 1.1.0
   236 │ 1.0.0
     2 │ 2.0.999
     2 │ 1.2.14
     2 │ 1.4.14
     2 │ 3.1.14
     2 │ 3.0.14
     2 │ 1.2.7
     2 │ 1.4.7
     2 │ 1.8.7
     2 │ 1.3.7
    11 │ 0.0.7
     8 │ 1.0.7
     3 │ 0.9.7
     4 │ 0.1.7
     1 │ 0.20.1
     2 │ 4.0.1
     7 │ 0.7.1
     2 │ 2.1.1
     2 │ 2.6.1
     2 │ 2.5.1
     2 │ 2.4.1
     2 │ 1.6.1
     4 │ 0.5.1
     4 │ 3.2.1
     6 │ 3.1.1
     2 │ 0.8.1
     7 │ 0.6.1
    14 │ 0.4.1
    10 │ 1.5.1
     2 │ 0.13.1
     2 │ 1.15.1
     2 │ 1.9.1
     6 │ 1.7.1
     4 │ 3.6.1
     2 │ 3.4.1
     6 │ 2.3.1
     9 │ 2.2.1
     4 │ 0.11.1
     4 │ 0.10.1
    13 │ 1.4.1
    12 │ 0.9.1
     6 │ 3.0.1
     2 │ 9.2.1
     2 │ 9.1.1
     2 │ 9.0.1
     2 │ 8.3.1
     2 │ 8.2.1
     2 │ 8.1.1
     2 │ 7.5.1
     2 │ 7.4.1
     2 │ 7.2.1
     2 │ 7.1.1
     2 │ 7.0.1
     2 │ 6.2.1
     2 │ 6.1.1
     2 │ 6.0.1
     2 │ 5.2.1
     2 │ 5.1.1
     2 │ 5.0.1
    26 │ 2.0.1
    42 │ 0.2.1
    24 │ 0.3.1
    68 │ 0.1.1
   101 │ 0.0.1
    35 │ 1.2.1
    39 │ 1.1.1
    28 │ 1.3.1
   108 │ 1.0.1
     2 │ 3.1.16

I suspect you could get much more interesting and divergent version data from @Nemo157, since they're looking at Rust crates.

theory commented 4 years ago

And here's a query of the various lengths of the major, minor, and patch parts:

WITH a(version) AS (
    SELECT version FROM distributions
     UNION ALL
    SELECT ext_version FROM distribution_extensions
)
SELECT 'major' AS part,
       COUNT(*) AS count,
       length(get_semver_major(version)::text) AS length
  FROM a
 GROUP BY length(get_semver_major(version)::text)
UNION ALL
SELECT 'minor',
       COUNT(*),
       length(get_semver_minor(version)::text)
  FROM a
 GROUP BY length(get_semver_minor(version)::text)
UNION ALL
SELECT 'patch',
       COUNT(*),
       length(get_semver_patch(version)::text)
  FROM a
 GROUP BY length(get_semver_patch(version)::text)
;

 part  │ count │ length 
───────┼───────┼────────
 major │  2762 │      1
 minor │  2628 │      1
 minor │   134 │      2
 patch │  2708 │      1
 patch │    52 │      2
 patch │     2 │      3
(6 rows)
Nemo157 commented 4 years ago

Here's a dump of all the version numbers of rust crates from about 3 weeks ago: https://pastebin.com/wcKsdvxh

The other major semver user that would have a lot of data for this would be npm, I'm not sure if they have any easily accessible data source to pull it from though.

jwdonahue commented 4 years ago

That gives me enough to get the right ball-park and I can always use more test oracles.


I'll ask NPM for a dump of their version strings. I am sure they have many that are not SemVer as well. I can use that to setup a test framework for SemVer implementations.

jwdonahue commented 4 years ago

Using @Nemo157's link, I messaged the data to break it down to a CSV with count, major, patch, prerelease and meta fields:

https://pastebin.com/vQizFgWB (expires in 1mo)

I then wrote a short program in C# to process that data:

https://pastebin.com/XWfLk8RQ (never expires).

So out of 11152 version strings:

Count %major %minor %patch
1 93.2209469154 81.6266140603 72.7941176471
2 5.0304878049 16.8669296987 21.5925394548
3 1.3898852224 0.3228120516 4.1696556671
4 0.3138450502 0.4483500717 0.0807030129
5 0 0 0.6635581062

So by my conservative estimation (just eyeballing it and ignoring table overhead), I'd say you could save at least half the storage space by breaking the version strings into major, minor, patch, prerelease and meta tags and storing those in string columns. Then you get the advantage that the database can easily sort and do range searches. Since most of the version triple fields are single or two digits, the compares are going to happen at least as fast as if you were comparing integers, probably faster, considering the savings on I/O and data bus bandwidth and higher cache hit rates.

Now it could be that primary and secondary key generation eats some or all of that storage savings, but I think it should more than pay for itself when searching and doing range comparisons. I seem to recall that your storing blobs, so there's probably no key tables associated with that data, which means a search involves linear processing through potentially the entire table, if my pgSQL neuron is firing correctly after 17 years of quiescence?

jwdonahue commented 4 years ago

I would add that it would probably be worth trying decimal number fields for major, minor and patch. I believe those are stored in a more compact format (BCD perhaps?) and give you the advantage of doing simple math, such as highestPatch + 1.


Avoiding blobs, also makes your work more portable across database engines.

theory commented 4 years ago

Well of course one could create a composite type:

CREATE TYPE semver AS (
    major text,
    minor text,
    patch text,
    prerelease text,
    build text
);

But it's not very useful. No stringification, for example. Better would be to update the C implementation with a different (and distinguishable) internal format that would store the parts independently — similar to what we have today, but strings instead of ints for the major, minor, and patch, and the build metadata split off from the prerelease. Something like:

typedef struct semver
{
    int32 vl_len_;
    char major[];
    char minor[];
    char patch[];
    char prerel[]
    char build[];
} semver;

Not sure how to update it from the current representation, unless there's something we can use inside vllen:

https://github.com/theory/pg-semver/blob/4d9f92fdcbfce00aeb2689b6dd3e49a885a1e077/src/semver.c#L57-L63

theory commented 4 years ago

Oh, and PostgreSQL just indexes based on comparison operator classes, defined here for btree and hash indexes:

https://github.com/theory/pg-semver/blob/4d9f92fdcbfce00aeb2689b6dd3e49a885a1e077/sql/semver.sql#L194-L228

theory commented 4 years ago

I would add that it would probably be worth trying decimal number fields for major, minor and patch.

Oh well now I'm just confused. That's what we have today, is it not?

jwdonahue commented 4 years ago

Umm... I thought you had a struct with int fields, or am I thinking of some other implementation I spotted in the last week?

jwdonahue commented 4 years ago

Well I guess I'll have to pull and install pgSQL. It's been a long time since I used any SQL, so I can't trust my memory at this point, but I was thinking more along the lines of one or more tables of strings. Until I've run the perf comparisons, I really don't know which wins out, but I suspect the plugin + blob isn't optimal.

Trying to wrap up testing on the new C code. Might be EOW.

jwdonahue commented 4 years ago

So what's the go-to database modeling tool for pgSQL these days?

theory commented 4 years ago

Yes, it's currently like this:

https://github.com/theory/pg-semver/blob/4d9f92fdcbfce00aeb2689b6dd3e49a885a1e077/src/semver.c#L57-L63

But that won't support versions such as (from the test corpus):

99999999999999999999999.999999999999999999.99999999999999999

So I was assuming you had been planning to make it a struct of char arrays.

theory commented 4 years ago

So what's the go-to database modeling tool for pgSQL these days?

Depends on what you want. I like to use psql, the command-line client. For developing databases, I use Sqitch. If you just want a GUI interface, there are a lot of them, but pgAdmin is probably the best known.

jwdonahue commented 4 years ago

Ya I vaguely remember using pgAdmin in the past. I used to mostly use Slick-Edit and whatever SQL command line tool came native with the database, but, that was back when I actually did non-trivial DB work and also had access to some seriously expensive modelling tools.

Yes, you'll have C code that works with strings. You'll probably want to go with the least disruptive option that gets you full compliance initially, but I do intend to run perf tests and storage comparisons with that and a more traditional database solution. I am looking beyond SemVer and existing tooling towards, something more flexible.

Everybody trying to bend their workflows around SemVer doesn't make a lot of sense to me anymore.

jwdonahue commented 4 years ago

While I am thinking about it (again). We can set 'vllen' to -1 to indicate the record is in the new format. I suppose that because it's a blob, there's no table updates involved beyond changing some stored procedures and pointing to the new plug-in code? Or do we just make the code simpler and provide an upgrade-in-place option, so that all the records are of the same type? With less than 12K records, it probably wouldn't take long to update them all. Then you don't have to maintain code supporting old records, but I think it means the version strings would be offline until the update completed. I don't think we'd want to try to roll of that upgrading into a single transaction, and even if we did, it would probably kill performance anyway.

Your heads been in that space and it's really up to you folks anyway, so I'll let you figure that out.

theory commented 4 years ago

vl_len_ is a PostgreSQL-internal value, which according to the source should never be set directly:

https://github.com/postgres/postgres/blob/c301c2e739c642199f9e4c62d867dc7bd5ef0fd1/src/include/c.h#L541-L559

The semver type handles that in one function make_semver():

https://github.com/theory/pg-semver/blob/4d9f92fdcbfce00aeb2689b6dd3e49a885a1e077/src/semver.c#L77-L92

And the function that reads the data from the database simply casts a memory reference to a semver struct:

https://github.com/theory/pg-semver/blob/4d9f92fdcbfce00aeb2689b6dd3e49a885a1e077/src/semver.c#L65

Details on SET_VARSIZE() and other marcos in the docs for C data types:

Finally, all variable-length types must also be passed by reference. All variable-length types must begin with an opaque length field of exactly 4 bytes, which will be set by SET_VARSIZE; never set this field directly! All data to be stored within that type must be located in the memory immediately following that length field. The length field contains the total length of the structure, that is, it includes the size of the length field itself.

So I don't know if we can abuse it the way you suggest. :-(

jwdonahue commented 4 years ago

Well then just one negative version field should be good enough. Do you know what the field alignment is on that struct? Is it packed or something else?

I am not sure this can work at all:

typedef struct semver
{
    int32 vl_len_;  /* varlena header */
    int32 dataVersion;
    // How to do variable length strings in this environment?
} semver;

I think you may be screwed with this blob. I think it's clear you have a variable length struct in the original code. If we pack a bunch of strings onto the end of the struct, then we need a bunch of start/end indexes, which seriously defeats the purpose. Unless there's a way to use pointers to strings?

You could just store the whole string, but then you have to parse your way through it for every comparison or, again, store a bunch of additional index and length values. So what I am doing probably isn't a solution for you, without introducing a breaking change involving a new database schema.

So, just to get you better than average conformance, I would create a union of your existing structure:

/* memory/heap structure (not for binary marshalling) */
typedef struct _SemVer1
{
    int32 vl_len_;  /* varlena header */
    vernum numbers[3];
    char prerel[]; /* pre-release, including the null byte for convenience */
} SemVer1;

typedef struct _SemVer2
{
    int32 vl_len_;  /* varlena header */
    int32 dataVersion; /* -1 == new format, > -1 for old (as it is now)*/
    uint128 major; // Just use the biggest unsigned integer available!
    uint128 minor;
    uint128 patch;
    uint32 metaTagOffset; // Offset into tags where build meta begins.
    char tags[];
} SemVer2;

typedef union 
{
    SemVer1 v1;
    SemVer2 v2;
} semver;

If v2.dataVersion > -1, use the old structure fields (v1), if v2.dataVersion == -1, use the new fields (v2). You'd be counting on dataVersion to overlap completely with either numbers[0] or numbers[2], depending on architecture. For that to work, the packing/alignment of the new union, must be identical to the old struct. I don't see anything in your environment that indicates packing/alignment has been changed from whatever the default configuration is for your tool chain.

I can't compile any of your code, so I can't make the changes for you.

theory commented 4 years ago

Well then just one negative version field should be good enough. Do you know what the field alignment is on that struct? Is it packed or something else?

I don't. Now would be the time to ask pgsql-hackers for advice.

    // How to do variable length strings in this environment?

I believe Postgres handles it internally. We just give it a memory reference where the first VARHDRSZ bytes define the amount of memory allocated.

Unless there's a way to use pointers to strings?

I suspect we can just keep the whole thing a pointer. What if we defined the new type as

typedef struct semver {
    VARHDRSZ vl_len_;
    major char,
    char major[];
    char minor[];
    char patch[];
    char prerel[]
    char build[];
} semver;

Then in the function we check the byte after VARHDRSZ, and if it's negative, we know it's the new format, but if it's 0 or positive, we know it's the old and convert accordingly. (I'm assuming that a negative char would also be a negative value in the first int32 of the numbers[3] array in the old format, but that could easily be mistaken and need to use int32 for the version, instead.)

Reading the rest of your message, sounds like your suggestion is similar, but with a union (we have exceeded my miniscule knowledge of C).

I can't compile any of your code, so I can't make the changes for you.

Oh? That's odd.

theory commented 4 years ago

Oh, yes, a union looks super handy for a case like this.

theory commented 4 years ago

Oh, and I suspect a v2 struct like yours would be essential; there can probably be only one variable length field at the end of the struct.

jwdonahue commented 4 years ago

Can you provide me with the exact compiler command line and compiler vesion used to build the code? It's probably buried somewhere in a network of travis config files and might be easier to find in a build log. You might have to enable build diagnostics to get that.

There's too many possibilities when it comes to how various C compilers handle variable length structs (ie; the last field is an array of the form '[]'). So many possibilities that I don't even remember if support for that ever got into the standard, and if it did, which one? Even so, when it comes to alignment and packing, the default behavior under certain optimizations, can be surprising at times, especially when they are tossed into a union. I seem to recall that given {int one} and {int two[]}, a union will merge such that one and two[0] overlap, but, I am pretty sure that nothing in any of the C standards requires an int to align on the same address as an int array, so these sorts of things are usually defined by the compiler docs and always dependent on compile time configuration.

theory commented 4 years ago

If you've installed PostgreSQL, you will also need the development libraries. What OS are you on, and how have you installed it?

Easiest for a *nix-type system would probably be to use pgenv to compile it from source, so that you then have the a default installation without any OS packaging shenanigans. If you've installed it via a packaging system, install the devel package too, as the extension needs it to build. On Ubuntu flavors (including the Postgres Docker images), I run

apt-get -qq install build-essential postgresql-server-dev-12

To get what I need to build against Postgres 12. Once you have Postgres running and all its headers and libraries and pg_config in your path, you should be able to simply

make
make install
make installcheck
jwdonahue commented 4 years ago

I am limited to 2GB of data for the next month and I run Windows. Not installing any unnecessary software on my system. You don't have any build logs?

theory commented 4 years ago

I undertand. I have no build logs, because I haven't tried to change anything (yet). I was thinking of playing around with your parser once it was ready.

jwdonahue commented 4 years ago

Well it's not too critical. We can always try it and see if it works. There's a good chance it will.

I am currently testing the library after taking a brief detour to write some code to analyze semver strings. I'll push the code to my CSemVer-Implementation repo shortly, if I don't lose communications again. The local networks have been overloaded since the curfews went into place.

jwdonahue commented 4 years ago

The code is posted here: https://github.com/jwdonahue/CSemVer-Implementation

Any code reviews/suggestions welcome. It's an incomplete WIP, but I should wrap it up soon.

theory commented 4 years ago

Oh interesting, it's basically a parser, it but it doesn't rewrite the data into a canonical struct that can then be used for comparisons and the like. For the Postgres type, that would mean having to parse it for comparisons or to extract parts, not just on input. I can follow the code pretty easily, though, and could see modifying the input function here to use the parser and then use the VersionParseRecord to write it into a struct for storage. Could use the existing struct to begin with, but maybe it would be more efficient to simply store the char array reference and parse it any time we need to do something with it?

jwdonahue commented 4 years ago

I am planning on adding a function to output major, minor, patch, prerelease and metatags as single strings. Storing the strings would be more efficient, and I would use the function to write a record, not a custom data type. Separate tables for major, minor, patch, prerelease fields and metatags would be far more efficient and generally more useful. A semver record would just be a list of references to records in other tables. Use the C code to parse and classify input strings, then use standard database practices for storing the data. It's all just strings and references to strings. That's what database engines do best. A single stored procedure for sorting, might call into an extension, but I would write something far more portable than that.

jwdonahue commented 4 years ago

The VersionParseRecord and ParsedTag records are used for comparisons, but they are intended to be transient. They can easily be used to build a table record full of string fields. Once you have a record of string fields, sorting and comparing is a matter of writing portable SQL queries.

jwdonahue commented 4 years ago

The way I am coming to think of it, there's numeric fields and alphanums. Looking at the data collected for analysis thus far, it's looking like there's a lot of duplicate string segments in both categories. You could greatly reduce storage requirements by fully normalizing. Let the DB engine cache the most common field values and the results of comparisons between them, and you should get a perf improvement. You don't need a DB extension to compare "1" to "2" right?

So all you need is an efficient front end to classify/parse out the strings. Given the complexity of that logic, it's probably best solved in an imperative language like C, but the sorting and satisfying range requirements, is best solved using a (mostly) declarative language like SQL.

theory commented 4 years ago

I could see the use of a function that returns a record, something like:

> SELECT semver_to_record('1.2.0-beta');
 major | minor | patch | prerel |  meta  
-------+-------+-------+--------+--------
 1     | 2     | 0     | beta   | [null]

But in my mind that's just an extension of the functionality. I think most folks think of semvers as scalar values, and the data type defined by this extension is super useful for that purpose, allowing one to encapsulate validation and sorting in a way that would be much more complicated to do in plain SQL. It's not dissimilar in my mind to the timestamp type, which one does not generally store in a table record with columns for year, month, day, hours, etc. Instead, it's a scalar value with a (basically) encapsulated inner format that one needn't worry about. This makes them very easy to work with, and all the efficiency is ensured by the implementation.

Perhaps there are cases where one might want to split up a semantic version to manage its parts in SQL as database records, but I would expect that to be the exception (supported by a function as as the sketch above). More useful to most folks is a scalar type that encapsulates most of that sort of functionality, and does its best to use a space-efficient internal representation. The current representation is good enough for most purposes, aside from numeric parts greater than int32. If we wanted to solve that problem, we would need to change the internal format, and the question, in my mind, is to what, exactly, and how to distinguish the old from the new, and whether to trade more storage for more computational efficiency.

jwdonahue commented 4 years ago

A timestamp is easily converted to a single scalar value for efficient storage and comparison. Time input strings are translated to a count of a fixed time increment since some epoch. The fact that their is an efficient and deterministic algorithm to translate to/from scalar and string formats, and the fact that you do in fact save on storage as well as sort/compare time, ranks high among the few reasons to use a scalar time stamp. The other reason, is, the scalar time-stamp is universal. Without it you store times in one format and then translate to another. But it does have one major draw back. The architectural limit on the size of scalar types, and the time remaining to the end of the current epoch (any one remember Y2K?).

As far as I know the closest thing to an efficient scalar storage format for semver is in fact the entire string of bytes. But that's not enough to improve on sort and compare, because the string is a composite value with variable length fields. Hence my suggestion to store a string for every dotted field. Databases compare and sort records of strings very efficiently and a fully normalized relational database can also be extremely storage efficient as well.

The current pg-semver implementation does not store a scalar, it's a composite record. I suspect it's non-optimal in both storage and sort/compare time. While you do convert the triple fields to integers, you do not preprocess the prerelease string, so on every compare operation, you have to take that apart for comparison. You've basically not achieved anything like the storage and perf improvements of the timestamp over the formatted string. From the data I've collected so far, it would make more sense to use a single uint64_t for the entire version triple, with the 2 msB's allocated for major, the next three bytes for minor and the remaining lsB's for patch, at least from a storage/compression and compare perspective.

But you have data in that format and a migration problem, so by all means, change a few datatypes to allow larger triple fields, and continue as you are currently doing. I've already suggested one way you can maintain back-compat with the existing data. I wouldn't change anything else, until I had good performance data to evaluate.


Addenda: The most efficient semver storage format would store compressed strings, using a fixed code dictionary, such that -alpha would be stored in a few bits. Not sure if there's a compression algorithm that would work for us, that also allowed sorting the compressed data.

theory commented 4 years ago

A timestamp is easily converted to a single scalar value for efficient storage and comparison.

Yeah, not a perfect analogy.

As far as I know the closest thing to an efficient scalar storage format for semver is in fact the entire string of bytes. But that's not enough to improve on sort and compare

Yes, it's the most space-efficient for sure, and yes, it would require reparsing for every comparison. The thing is, I don't know if that's actually very high of a cost, because the parsing is pretty efficient (O(n) or close to it), semvers are nearly always quite short, and I don't know how it compares to the overhead for comparing each part.

Hence my suggestion to store a string for every dotted field. Databases compare and sort records of strings very efficiently and a fully normalized relational database can also be extremely storage efficient as well.

Here's where things get a little confusing, because when one talks about database normalization, one normally means representing values in tables. But data types are just blobs of memory to the database, with no knowledge of what's in them aside from the functions to call for particular behaviors. It certainly doesn't know if a blob of bytes can be represented by a string, a struct, or anything else. I do know that it can compress values over a certain size via TOAST. I suspect it would rarely compress semantic versions. Hence the current semver type doesn't bother with TOAST.

The current pg-semver implementation does not store a scalar, it's a composite record.

I think we're getting into semantics here, since to Postgres it's just a blog of bytes. It only has meaning within the context of the functions for the type. It's also not represented as a database record, hence my confusing when you refer to the struct as a "composite record."

From the data I've collected so far, it would make more sense to use a single uint64_t for the entire version triple, with the 2 msB's allocated for major, the next three bytes for minor and the remaining lsB's for patch, at least from a storage/compression and compare perspective.

Sure but what about when the sizes add up to more than 64 bits?

But you have data in that format and a migration problem, so by all means, change a few datatypes to allow larger triple fields, and continue as you are currently doing. I've already suggested one way you can maintain back-compat with the existing data. I wouldn't change anything else, until I had good performance data to evaluate.

Okay. In that case I'm inclined to leave things as they are until such data has been generated. Less work, works for nearly all cases (aside from #47), and we can come up with a longer-term plan with the data in hand.

Thoughts?

theory commented 4 years ago

I think 1.0.0-beta1 is gone now. I expect to come back to this issue often for the discussion, though, it has been super helpful.