Backward compatibility of the Log

As OrbitDB version 0.20.0 (orbitdb/orbit-db#524) is getting nearer and work on IPLD support (#200) has started, it would be a good time to discuss about the backward compatibility of the log. Currently there is not much:

Common way of dealing with backward compatibility is to add versioning information to any data structures that live longer than software versions. However, logs or their entries contain no versioning information. entry-struct has v-field but it cannot be used to differentiate between incompatible versions because it's currently always 0.
There are no tools, procedures, or mechanisms in-place to resolve the structural and semantic incompatibility between two logs. Therefore to upgrade to backward incompatible ipfs-log version, users have no choice but to implement the incompatibility resolution by themselves which will be either a lot of work or just impossible.

For example, in next release there's a new identity field in entry-structures. Current version expects it to be there when entries are loaded from IPFS, and access-controller will actually fail if there's no identity information in entries to append. All the log entries created with previous versions will not have this information. Fortunately, this check is done only on new items appended/joined into log, so appending new entries to old logs will still work after version upgrade.

Some design aspects that I see:

How to define the version information itself? Monotonically increasing number, semver/calver, ipfs-log-version, or something else? I think we just need to differentiate between two log versions if they have structural or semantic differences. Many database migration schemes use partially ordered versioning which is ok too.
Versioning the log+entries, vs versioning just the log? Just versioning the log would probably have lower overhead (no redundant versioning in entries, fail-fast version checks in LogIO.fromMultihash()), whereas versioning the entries too would allow joining logs with different versions together and be more flexible in backward compatibility. Single-version log would probably need to have an internal version-specific logId which would then have consequences on entries' log references too.
Should there be support for multiple versions on code level, or require that older log versions need to be migrated to the single code-supported version first? Supporting multiple log/entry versions can make the development quite troublesome and error-prone, whereas requiring migrations will make the upgrading process more involved (especially with larger logs).
Could we use the upgrade/migration mechanism to help users with payload versioning as well?

Any thoughts, opinions? :slightly_smiling_face:

I'll kick off the discussion with a proposal that we use a monotonically increasing version field inside of the entries themselves, the absence of which to be treated like the value 1. The field's explicit value will start at 2.

This has the benefit of freeing the entry versions from having to be in lock-step with the package version, and gives us all the added benefit of being able to join logs of different versions. If possible, best to leave the old entries where they are instead of recreating/duplicating them in a migration.

Entries without identities: leave as public?

Lots of great thoughts here, thank you @vvp and @aphelionz!

monotonically increasing version field inside of the entries themselves

Agreed. We have this as v field now as @vvp mentioned, which is set to 0 atm. For starters, we should increase the version number to 1 :)

Entries without identities: leave as public?

Old versions also have a signature, under .key field in the data structure, which maps to identity.publicKey in the new version.

Should there be support for multiple versions on code level, or require that older log versions need to be migrated to the single code-supported version first? Supporting multiple log/entry versions can make the development quite troublesome and error-prone, whereas requiring migrations will make the upgrading process more involved (especially with larger logs).

This is very true. I don't think we can "migrate" the logs in a way that the actual entries will be converted to the new structure due to the signatures in each entry. Which, I believe, leaves us with the second option of supporting multiple versions. However, as you say @vvp, this can make the code quite complex and highly error-prone, so it seems to me that the question is:

Do we want to or need to support multiple versions? If not, what are the consequences to users? If yes, what are the consequences to development and for maintainers (eg. do we commit to support all versions of logs from today all the way to the far future)?

Is there a way we could provide migration tools in a way that the end-user initiates and authorizes the migration (ie. they re-sign all converted/transformed entries) instead of developers building on/with orbitdb?

This is very true. I don't think we can "migrate" the logs in a way that the actual entries will be converted to the new structure due to the signatures in each entry. Which, I believe, leaves us with the second option of supporting multiple versions.

I've been thinking about the same thing but in https://github.com/peer-base/peer-base land and this is the way to go. There's always the latest canonical version of the data-structure and we must convert old versions to the canonical version when reading. This means we must tag those data structures with the versions and have code to migrate to the latest version incrementally.

Also, I would think that embracing conventional commits would improve the visibility of changes to developers. Many projects in the IPFS land already use them. You may check how to quickly setup the necessary tooling on some repos, for instance, this one. Basically:

Setup husky to lint commit messages so that they obey to https://www.conventionalcommits.org/en/.
Setup npm run release to use standard-release so that it automatically bumps the version and updates the CHANGELOG.md based on the commits

I made a comment in @satazor 's PR that begins to address this: https://github.com/orbitdb/ipfs-log/pull/213/files#r244635479

Reading back through these comments, I believe we should increment the version number v field from 0 to 1 as well.

I would like to make a more formal proposal based on the discussion we had on #213.

Data-structures

It's normal for the data-structures of ipfs-log to evolve over time. This happened once when we introduced IPLD links support and it will eventually happen again in the future.

All the code that interacts with those data-structures should always assume that they are in the latest version. This makes it easy to reason about the code because there's only one shape of the data-structures: the most recent one. Instead of doing this in an ad-hoc manner, we should come up with a scheme that would allow us to transform those data-structures from older to newer versions and vice-versa. These are the scenarios to take into consideration:

When reading a log or an entry, we might be reading an older version. In this scenario, we must transform the entry or log "upwards" to the most recent version of it.
When reading a log or an entry, we might be reading a newer version. In this scenario, we can't transform the entry and we should error out.
When writing a log or an entry, we might need to write it in a older version to keep the same CID. In this scenario, we must transform the entry or log "downwards" to its original version.

Having that said, I propose to tag all the data-structures with a v property that contains its version. We already have that setup for entries but not for logs. Assuming that we now have a consistent way to identity the version of a data-structure, we may have a versioning pipeline based on the following scheme:

const schema = [
  versions: [
    {
      version: 0,
      up(data) {},
      down(data) {},
      codec: { name: 'dag-pb-v0' }
    },
    {
      version: 1,
      up(data) {},
      down(data) {},
      codec: { name: 'dag-cbor', ipldLinks: ['next'] }
    },
    // more in the future...
  ],
  codecs: {
    'dab-pb-v0': {
      matches(cid, dagNode) {}
      fromDagNode(dagNode) {},
      toDagNode(data, ipldLinks) {}
    },
    'dag-cbor': {
      matches(cid, dagNode) {}
      fromDagNode(dagNode) {},
      toDagNode(data, ipldLinks) {}
    },
    // more in the future...
  },
]

...where:

scheme.versions[].version: The version number of the version entry
scheme.versions[].up: A function that receives data and transforms it to the next version
scheme.versions[].down: A function that receives data and transforms it to the previous version
scheme.codecs[].matches: Returns true if dagNode is of the given codec entry
scheme.codecs[].fromDagNode: Retrieves the underlying data of the dagNode
scheme.codecs[].toDagNode: Creates a dagNode for the data to be stored, converting any ipldLinks to IPLD links

A versioning pipeline based on the schema, would have the following API:

`verPipeline.read(scheme, dagNode): data`

Reads the underlying data of the dagNode.

Find the codec entry by calling scheme.codecs[].matches until one returns true.
- If none matches, error out.
Retrieve the data stored in the dagNode by calling fromDagNode on the codec entry that matched.
Find the version entry that matches data.v from scheme.versions[].
- If none is found, error out.
Transform any IPLD links into regular strings specified in the codec.ipldLinks of the version entry
Run every up function, starting from data.v up to the most recent one
Tag data with its original version in case by defining data.ov as non-enumerable (ov stands for original version)

`verPipeline.write(scheme, data): dagNode`

Creates a dagNode for the data, according to its version.

Find the version entry that matches data.v from scheme.versions[].
- If none matches, error out
Find the version entry that matches data.ov from scheme.versions[].
- If none matches, error out
Run every down function, starting from the version entry correspondent to data.v down to data.ov.
Find the scheme.codecs[] that matches the codec.name property of the version entry correspondent to data.ov.
Calls the toDagNode from the codec entry with the correct ipldLinks based on codec.ipldLinks of the version entry correspondent to data.ov.

Public API

Changes on the public API are not as problematic as changes to the data-structures.

Having backwards-compatibility normally comes as at the cost of code-complexity. Having said that, choosing to have backwards compatibility is a per-situation decision.

Nevertheless, a breaking change should always translate to a new major version of the module. Moreover, all the changes (fixes, features, breaking changes) should be easily visible to users of ipfs-log. This is usually made possible via a changelog which can be automated using the right tools. I propose the following:

Thrive for having one commit per PR. We can do this by using the Squash button instead of the regular Merge when merging a PR.
Embrace Conventional Commits so that tools can infer the type of commit.
- Setup commitlint to ensure commits are valid
Use standard-version to create new releases. This tool will automatically bump the version of ipfs-log based on the commits made since the last release (breaking changes: major, feat: minor, fix: patch) and generate the CHANGELOG.md file for us automatically

Let me know your thoughts!

orbitdb-archive / ipfs-log