Ideas & resources for new crate design

Here I describe some ideas, thoughts, resources and possible design decisions I gathered during a brainstorm session to help developing a new crate design for version-compare in the future. It lists things I like to have, some facts we can use, a simple list of existing version schemes and so on. I hope this helps to shed light on all desired features in a new design, by distilling what is used currently.

Some things may not be implementable, might be incorrect and I'm probably missing a lot. That's something to look into later.

Like to have

Here are some ideas I'd like to have:

Version & Ver to represent comparable version numbers
Like String and str, owned and referenced
Easily compare Version and Ver instances
Implement comparison traits, and useful helper functions, similar easy-to-use API to version-compare 0.0.10
Version schemes, for different version parsing and comparison logic (semver, gnu, ...)
Generic Version & Ver to compare on, do not use different structs for all kinds of different schemes
Great extensibility
Allow users to externally define their own version schemes, to be a framework to compare different version styles as well. Good implementations could be merged into this crate later on. Also allow minimal configurability on some existing schemes, such as the generic/default scheme used for comparison to tweak sorting.
Stream the parsing version number components:
Could speed things up when comparing large version strings
Validity checking
Allow checking whether a version string is valid given a version scheme.
Configure comparison logic between each component type
1! > 2: Epoch(1) > Major(2)
Compare across different version schemes (would this even even be possible?)
Support version operator expressions (Rust, npm)
Testing operators such as > ,>=, !, *, ^, ~
>=1.2, ^1, ~1.2.3, 1.2.*

Design facts

Here are some design facts/choices that I think will be assumed/used for the redesign of this crate. This could be used to build the core Version and Ver structs upon, allowing support for a large spectrum of version numbers:

Versions are sequence based (https://en.wikipedia.org/wiki/Software_versioning#Sequence-based_identifiers)
Versions can be represented as a string
Versions have components (might just be a single one)
Versions may have an arbitrary number of components
Version schemes may introduce any number of additional local constraints
Version components are always sequenced from major to minor (little-endian)

Not compatible with everything in ..., but compromises must be made:
https://xenoterracide.com/post/falsehoods-programmers-believe-about-versions/

Existing version schemes

Some version schemes that were mentioned and used right now. Could be used to source mechanics from to validate a new crate design is sufficient, and version schemes could be implemented for it:

semver
- Useful regex
GNU (#15)
- Leading zero sorting
Debian
- 1: epochs
- -debian-revision: suffix
- Ubuntu:
  - 5:5.0.3-4ubuntu0.1
rpm
- 1: epochs
Python
- Conda (mostly based on PEP 440, but some exceptions)
  - 1!: Epochs
  - special literal cases
PERL (old)
- float as version number

Core version components

Component types that might be used globally in the core of the crate:

Integer:
- May be negative
- Sorting:
  - Decimal sorting
  - Leading zero sorting (#15)
Float
Literal (string):
- Alphabetical sorting
- Lexicographical sorting (alphabetical) (#16)
- Special case sorting (dependent on version scheme):
  - dev: always lower than other releases
  - post: always higher than otherwise similar releases

Special version components

Component types that might be a version scheme specific implementation:

Epoch
- 1!: conda
- 1:: RPM
Patch version
Pre-release version
- SemVer
Build number
- SemVer
VCS revision info
- Git tag/commit
- Mercurial
- svn

Examples

Some version numbers I'd like to support, in different styles:

1
1.2.3
1.2.alpha.3
2.0.0-rc.2: semver
1:1.2.3-debian2: Debian
1!1.2.3-post: PEP440/conda
1.0a0: conda: the a0 is separate from the 1 and 0 parts. Pre-release indicator.
1.0.2b12: used by Apple
v1.2.3: common in tags
19999.00071
YYYY-mm-dd: dates
YYYYmmdd: dates
0.8.1+git.20100810t184654.ab580f4-0ubuntu2: dpkg package for git on Ubuntu
MyApp 3.2.0 / build 0932
1.0.2d: openssl, which has releases with & without the letter, but will gain sanity soon.

And some components as well:

1!: epoch in Conda
1:: epoch in RPM
~suffix: ~ is lowest when sorting in Debian
.suffix
-suffix
floating point value: in old PERL
1a: letters in components
-1.2.3: negative components
1.-2.3: negative components

Other resources

Some other useful resources, not really covered yet:

Edit: added suggestions from https://github.com/timvisee/version-compare/issues/20#issuecomment-537302014

This is a good summary, thank you for putting it together. I'd like to add comments, but the github issue format here feels cumbersome. Do you mind moving it to a Google Doc or similar to facilitate making (suggested) edits? I have started one at https://docs.google.com/document/d/1HlroZ02HcXzyIFGcQcLGs9QkGyQ5945Sr0GCHX7S6SU/edit?usp=sharing and added my comments there. I'm happy to move to any other medium if you prefer, but preferably after we settle my comments in that Google Doc, just to avoid needing to transfer them somehow.

PS: I'm happy to give you edit rights on that doc, or reassign it to you. I just don't know your email address and don't want to post a link-share edit thing.

Thanks a lot for the thorough look. I've commented on most of your comments in Docs.

If there's only one struct, how would you differentiate between results that use different schemes?

I usually like to keep things as generic as possible, if it doesn't hurt flexibility too much. Things like this are hard to comprehend, but can be super beneficial in the long run when done properly, when implementing a lot of schemes.

I'll try to explain what 'perfect' (in my opinion) situation I'm currently thinking about (in undefined points and order). I've changed my thoughts on design slightly since my last post, but here it is:

Parse versions in a Version struct. Use a default scheme, but allow selecting a custom scheme such as one for PEP440/Conda.
Version structs would probably remember what scheme it uses, but maybe that isn't even important.
The actual parsing to components would happen in the scheme implementation, we could probably provide various helpers for this as parsing version numbers is mostly similar.
The Version struct would contain a list all parsed version components.
The core crate would have a list of generic components available, such as Number and Literal, maybe Epoch and a few others as well even if unused in some version schemes.
Version schemes could implement their own component versions as well, possibly through some trait, with more specialized functionality. Imagine a component for holding a date, a git revision or whatever. Components that appear to be used in multiple schemes could eventually be moved into the core list of components, if their behaviour is consistent.
Comparisons between components should be defined. A number would be compared as decimal, literals in alphabetical order, git revisions probably won't be comparable due to being a hash, ...
Maybe define comparison logic across different component types as well, for Epoch(1) > Number(2). Not sure if this is useful though. Maybe something similar can be achieved by reordering components while parsing.
The core crate should define these comparison implementations for the core components, schemes should define these for component types they implement themselves.
The core system would always start comparing components from most significant to least significant by default, using these comparison implementations to make decisions.
It it happens that the end user compares two Versions having a different schematic, having components that are not compatible (comparability wise), the crate would probably just return an 'unknown'/'undefined'/'not compatible' result.

Now for some scheme specific cases:

The dev and post suffixes for Conda will probably get their own specific component, as their comparison logic differs from a regular literal (being alphabetical). Comparison logic between the special Conda literal and the regular literal could be implemented (so across different component types in this case), to always favor the special type as newer (not sure what the Conda version spec specifies at this moment).
In the case of GNU, a special component will be created for the leading zero numbers, for example LeadingZeros(num). This component would implement comparison logic the way GNU specifies for numbers.
In the possibly unlikely case a scheme uses negative version numbers. Only the parsing logic is changed slightly to pick-up negative numbers from the version string, it will then just use the core Number component and everything should work out-of-the-box.
RPM and Conda versions could both use the same Epoch component even though their separation token is different. (I'm assuming here they work in a similar way).
Looking at the VersionManifest, if a runtime configurable scheme is ever desired, I think this could be implemented as well. Imagine that your configurable scheme allows setting how literals are ordered. A new component could be introduced for these literals, having their ordering setting attached, or having a reference to the configured scheme that will in the end be used by attached comparison logic. All literals from a version string would then be parsed as this special literal component.

You get the point, all schemes have a lot of interesting edge cases with scheme-specific special components. Using this implementation, component types would probably define their significance, order and relation against other components.

Currently, in version-compare, when comparing a version number it walks through the components (parts) of both (zip, but smarter) versions at the same time, comparing each other. If one is larger than the other, that version is newer. If components are the same the system continues to the next component in both versions. Comparing to no component (in 1.2 and 1.2.3, comparing the 3 to no component from the other) would always be newer. Comparing a number or no component to a literal would always be newer. I like this idea a lot but it currently isn't extensible (nor doesn't suffice for different schemes) because all comparison logic is implemented in a single function. Attaching this the actual component types itself might solve this. I really hope work with and extend this core idea, because it seems like such a good idea to me. This is also why I'm constantly referring to a generic Version for all schemes by the way.

The question is, whether a new architecture like this would support all version schemes we can currently come up with with the constraints you might set on components. Maybe comparing different component types is enough. Maybe components should have an significance value. Maybe components should return what their relation is to other component types, I don't know.

So, to summarize:
A parsed version would be represented as list of components, consisting of components. The set of components is extendible. These have comparison logic attached to them. The create compares from left to right, component by component.

This idea attempts to heavily generalize all version numbers into something that fits all while assuming the 'Design facts' as mentioned before. It will never be super strict, but could 'just work' in many unthought of situations with weird version string formats.

Now, this rough idea definitely won't translate into Rust right now. I'm not even sure if a system similar to this idea is properly and efficiently implementable (if assuming this design would indeed work for all imaginable version schemes we'd like to support).

I'm currently thinking really hard to figure out whether a system like this could work. I haven't really thought of any blockade yet (a version scheme choice) that would immediately invalidate this generic system and make it useless. But I might just be missing something big here.

Perfectly comparing anything imaginable will always remain an impossible task, and this will always be a best effort approach. Schemes that differ so much from everything else are probably better of with a full custom comparison implementation anyway. Sadly, schemes and given versions are inconsistent.

What are your thoughts on the rough idea I've explained here? Could it hold up for everything we're trying to achieve here? Is it too much?

Would implementing every scheme separately (live you've suggested) be beneficial? Could we prevent lots of duplicate code with such an implementation? Wouldn't it be better for an approach like this to separate every scheme in a different tailored crates anyway as reimplementing a lot of similar stuff is probably required?

Sorry if what I'm saying here is unclear, and sorry for putting down 1.2k words. I always find it really hard to abstract/summarize my ideas in clear understandable terms. I hope the examples I gave help clarifying. Please correct me if I'm saying stupid things :)

I'm fine with discussing this through Docs, my Google address is my GitHub username with @gmail.com attached.

I've updated the first post with the changes you made in the linked document.

Thanks again for your time, thoughts and comments!

I hadn't thought of the use of custom components (and the creation of those components by the specific scheme) as the way to achieve custom comparison. I like it a lot! I think you're onto a very nice, clean way to do things. Perhaps the usage patterns could be made nicer by currying the Version struct instantiation with a given scheme, such that Conda could hide the need to pass the scheme explicitly each time. Maybe that happens in version-compare, or maybe it happens on the external (conda) side. Probably the latter.

I think the addition of scheme-specific info like epochs to the data structure for all schemes is not great, but it's also not horrible. In my opinion, things like epoch must be compared separately from the other version stuff. I don't like comparing 1!1.0 with 2.0 and say that 1:1.0 is higher on the basis of Epoch(1) trumps Version(2). Rather, Epoch(1) trumps the implicit Epoch(0). The vector comparison for other numbers otherwise holds, and then I think the end pre/post/git tag/whatever release may again need special treatment, but perhaps not, since Number(1) will trump Alpha(x) or Date(x).

Now, this rough idea definitely won't translate into Rust right now.

Why not? What specific parts do you think are missing? Is this at a point where some pseudocode and rough placement of data structures is appropriate?

... currying ...

Currying is a good idea, though that doesn't work too well in Rust at the moment. This could be in schemes, in the form of MyScheme::parse(ver) -> Version::parse(ver, scheme). Implementing this externally would be a nice alternative, and is up to the end user of course.

I think the addition of scheme-specific info like epochs to the data structure for all schemes is not great, but it's also not horrible.

I thought this would be common enough, and of course, it does not matter if schemes don't use such component.

In my opinion, things like epoch must be compared separately from the other version stuff. I don't like comparing 1!1.0 with 2.0 and say that 1:1.0 is higher on the basis of Epoch(1) trumps Version(2). Rather, Epoch(1) trumps the implicit Epoch(0). The vector comparison for other numbers otherwise holds, and then I think the end pre/post/git tag/whatever release may again need special treatment, but perhaps not, since Number(1) will trump Alpha(x) or Date(x).

I think you're suggesting to do version checking in various passes with a collection of components. So, first a pass for epochs, then a pass for numbers, then a pass for suffixes. Is that right? Thinking of it that way sounds like a good idea.

Now, this rough idea definitely won't translate into Rust right now.

Why not? What specific parts do you think are missing?

For example, for comparison you would normally implement Ord on structs (or components in this case). This works well when using the same type. But if comparison across different component types is needed, just for some combinations, this won't suffice. It's not a good idea to check at runtime whether a component has such trait implemented (I think, if even possible). I don't see a way yet to statically (typed) compile this.

And a list of generic components in Version would probably require a list of Box<Component> (where Component is a trait). Putting it on the heap this way isn't too great for performance either. using an enum (the usual alternative) probably isn't really an option because an enum is not extendable.

Is this at a point where some pseudocode and rough placement of data structures is appropriate?

Probably, yes.

Currying is a good idea, though that doesn't work too well in Rust at the moment. This could be in schemes, in the form of MyScheme::parse(ver) -> Version::parse(ver, scheme). Implementing this externally would be a nice alternative, and is up to the end user of course.

Agreed that the scheme is the right place to put this. Thanks!

I think you're suggesting to do version checking in various passes with a collection of components. So, first a pass for epochs, then a pass for numbers, then a pass for suffixes. Is that right? Thinking of it that way sounds like a good idea.

I thought about this more last night. I'm really torn on whether this is actually a good idea. Making it multi-pass loses a lot of generality, and I'm not entirely sure it's necessary, as long as the type definitions (including interactions between types) really can cover all comparisons that might happen. Comparing types 1:1 is not strictly necessary, and probably not even helpful aside from perhaps reducing some code necessary for defining comparison relationships between types. I'd like to try to avoid this multiple pass approach, but keep an eye out for reasons why it might be necessary.

Regarding your points about Rust's language features, I'm just not experienced enough to understand much here. You've given me good things to read about and tinker with, though. I'll try to come up with some toy code soon - today if time allows, but perhaps this weekend.

timvisee / version-compare