kwilteam / kwil-db

Kwil DB, the database for web3
https://www.kwil.com/
Other
36 stars 12 forks source link

Goal / Decision: Supporting payloads from Kwil v0.7 to Kwil v0.8 #670

Closed KwilLuke closed 5 months ago

KwilLuke commented 6 months ago

From our conversation this morning, I am creating this issue here so we can keep tracking it.

We need to decide if we are going to focus on making Kwil v0.8 "directly upgradable" from Kwil v0.7.

Right now, the main incompatibility is that the Schema struct has changed between versions. This affects RLP deserialization, which means that v0.7 schemas will not be deserializable from v0.8.

What we need to decide:

  1. Should the v0.8 schema struct be a "version 2" (and v0.7 schema be "version 1"), and in Kwil v0.8 we will support both version 1 and version 2?
  1. OR should we just say that in v0.8, you are expected to start a network from scratch?

Creating this issue here so we can discuss and decide if we should close.

jchappelow commented 6 months ago

Right now, the main incompatibility is that the Schema struct has changed between versions. This affects RLP deserialization, which means that v0.7 schemas will not be deserializable from v0.8.

I think it goes beyond serialization unfortunately. Execution of a deploy tx has a different outcome, both in terms of state modification and how engine handles it. Branched logic would need to exist in several places I suspect. I'd have to look into it, but I'm not sure about if/how the commit ID could be kept the same for older deployment txns if the resulting postgres schema for a data set is any different. Maybe it would be the same if the engine were able to handle the old types.

KwilLuke commented 6 months ago

@jchappelow got it (I think). I guess the main thing we need to decide is if it makes sense to even worry about this... and then we can assess the scope. @brennanjl any thoughts? Maybe we confirm the Fractal/Truflation setup?

jchappelow commented 6 months ago

The burden of supporting old blocks is high in this case. Particularly since we don't have the hardfork system that is needed to deal with this smoothly in place, it's probably best to have 0.8 be incompatible with 0.7. This is not the way going forward however. If we change our stance on this and decide to have migration tools, OK, but that's barely viable for a long lived blockchain use case.

jchappelow commented 6 months ago

Just to confirm this is not just speculation, if you attempt to sync with the staging network, which is still compatible with 0.7, you get failed to execute transaction {"error": "rlp: input list has too many elements for transactions.ExtensionConfig, decoding into (transactions.Schema).Extensions[0].Initialization[0] on the first deply, around block 41613:

2024-04-23T12:15:21.356-05:00   info    kwild.pg    pg/repl.go:244  Commit hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, seq 41612, LSN AC/BA60B908 (741861275912) delta 632
2024-04-23T12:15:21.36-05:00    info    kwild.cometbft  consensus/replay.go:495 Applying block  {"module": "consensus", "height": 41613}

2024-04-23T12:15:21.362-05:00   warn    kwild.abci  abci/abci.go:278    failed to execute transaction   {"error": "rlp: input list has too many elements for transactions.ExtensionConfig, decoding into (transactions.Schema).Extensions[0].Initialization[0]"}
2024-04-23T12:15:21.365-05:00   info    kwild.pg    pg/repl.go:244  Commit hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, seq 41613, LSN AC/BA60C060 (741861277792) delta 848
2024-04-23T12:15:21.369-05:00   info    kwild.cometbft  consensus/replay.go:495 Applying block  {"module": "consensus", "height": 41614}

2024-04-23T12:15:21.372-05:00   warn    kwild.abci  abci/abci.go:278    failed to execute transaction   {"error": "dataset not found"}
2024-04-23T12:15:21.372-05:00   warn    kwild.abci  abci/abci.go:278    failed to execute transaction   {"error": "rlp: input list has too many elements for transactions.ExtensionConfig, decoding into (transactions.Schema).Extensions[0].Initialization[0]"}
2024-04-23T12:15:21.374-05:00   info    kwild.pg    pg/repl.go:244  Commit hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, seq 41614, LSN AC/BA60C668 (741861279336) delta 832
2024-04-23T12:15:21.379-05:00   info    kwild.cometbft  consensus/replay.go:495 Applying block  {"module": "consensus", "height": 41615}
2024-04-23T12:15:21.38-05:00    info    kwild   server/build.go:295 closing signing store
2024-04-23T12:15:21.38-05:00    info    kwild.private-validator-signature-store badger/db.go:70 closing KV store
2024-04-23T12:15:21.38-05:00    info    kwild.private-validator-signature-store badger/db.go:233    Lifetime L0 stalled for: 0s

2024-04-23T12:15:21.381-05:00   info    kwild.private-validator-signature-store badger/db.go:233    
Level 0 [ ]: NumTables: 00. Size: 0 B of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 64 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level Done

2024-04-23T12:15:21.392-05:00   info    kwild   server/build.go:295 closing event store
2024-04-23T12:15:21.392-05:00   info    kwild   server/build.go:295 closing main DB
Error: panic while building kwild: block.AppHash does not match AppHash after replay. Got 55A82A2668E79A7257FE8B1274F2209FAF91C43289E11350F39AF69C60E50986, expected 24C8C90BC018E1F8C6C51868FAA3287A09244C064BB9205543D57E2448F3BAEB.

Block: Block{
  Header{
    Version:        {11 0}
    ChainID:        kwil-chain-9
    Height:         41615
    Time:           2024-03-14 16:11:26.451353814 +0000 UTC
    LastBlockID:    8FBCFCC3E506E234527604ECEDCD5FF664C5855B8C3FF5A026C6EC70AFC71A77:1:AE0B3A4291B8
    LastCommit:     CDD036730934458F4D353FCF403B062A9FFEEDA759235A8FB87549DA3804D805
    Data:           DEEA1032B13555DE234761E84514589C2C060CB0566A59C9DB8BAE46F85AC200
    Validators:     70ED3323943534C8D1DD8359DCE19880CFBA1563AA3801867A7FA81E2230BFB7
    NextValidators: 70ED3323943534C8D1DD8359DCE19880CFBA1563AA3801867A7FA81E2230BFB7
    App:            24C8C90BC018E1F8C6C51868FAA3287A09244C064BB9205543D57E2448F3BAEB
    Consensus:      7860905924B013DCAFC4CC660E4BAE732F4923F113F5C925D56D54AF5EB2AC6F
    Results:        0A01D2CD62B6525BB0298A20605C4F50A0C2C22F892BC343F6F429C459B24E2F
    Evidence:       E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
    Proposer:       643AF2A0462CEA175047C4B8A18817DD9429BB0B
  }#83C40B66D4620E329D26B273E43204EDAE55B7BAB14C18F90E0B0EDB4C7D2646
  Data{
    C59F968497BA618D203A40DF21CDA7953B37108E0B7F65AEBD77894264E4828D (215 bytes)
    2CBEAB6DB0B5152D4B47E5DD478FC1475E192875BBE766E921E98ED5AABAB069 (9430 bytes)
    52974FC64292F0B48E8F46AE848341C32F895367C2B4A1E689A437049B6A317F (9061 bytes)
    6D68CF9A08820BE82151CC222F7220249E0B169759BD0AC045F90C7D6CF4953A (7784 bytes)
  }#DEEA1032B13555DE234761E84514589C2C060CB0566A59C9DB8BAE46F85AC200
  EvidenceData{

  }#E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
  Commit{
    Height:     41614
    Round:      0
    BlockID:    8FBCFCC3E506E234527604ECEDCD5FF664C5855B8C3FF5A026C6EC70AFC71A77:1:AE0B3A4291B8
    Signatures:
      CommitSig{131D794D126B by 643AF2A0462C on 2 @ 2024-03-14T16:11:26.451353814Z}
  }#CDD036730934458F4D353FCF403B062A9FFEEDA759235A8FB87549DA3804D805
}#83C40B66D4620E329D26B273E43204EDAE55B7BAB14C18F90E0B0EDB4C7D2646

Just execution of existing transactions. So this is where you would have a coordinated hardfork. Nodes would always know when to execute with rules v0 or rules v1.

brennanjl commented 6 months ago

Still spending more time on it, but yeah this really is a pretty tough dilemma. A few things I'm sort've spinning on:

  1. Not having seamless / compatible upgrades does really suck for users, and will likely not be an option when multiple validators are running.
  2. We think of Kwil as a blockchain much more than our users do. Our users primarily think of it as a database, and savvy users know that it just so happens to be running a blockchain. I'm not sure if this is an important consideration, but it's worth at least acknowledging. Basic product requirements commonly expected of blockchains are not necessarily held by our users (public/private data is a good example here). I'm still not sure how this plays into upgradeability and forks, it's simply a difference I've noticed in how our team discusses problems versus discussions I have with Julio, Paulo, Ryan, Raffael, etc..
  3. The tech debt incurred from massive changes (such as the engine additions) is pretty massive, and will likely be grandfathered in to Kwil. I don't really see how we could ever get rid of it. This will have a serious negative compounding effect on future development speed.
  4. I think it's naive to think that this will be the last time we have large changes. We are still hunting for PMF, so we will still be pivoting and adjusting our core value prop.

I'm going to spend more time today on this, but this is just what I'm trying to keep in mind.

jchappelow commented 6 months ago

2. We think of Kwil as a blockchain much more than our users do. Our users primarily think of it as a database, and savvy users know that it just so happens to be running a blockchain. I'm not sure if this is an important consideration, but it's worth at least acknowledging. Basic product requirements commonly expected of blockchains are not necessarily held by our users (public/private data is a good example here). I'm still not sure how this plays into upgradeability and forks, it's simply a difference I've noticed in how our team discusses problems versus discussions I have with Julio, Paulo, Ryan, Raffael, etc..

For the sake of argument, while applications or operators may be ambivalent or even ignorant to the block chain aspect, we're are burdened with the qualities of a blockchain regardless. Namely, it makes upgrading difficult or impossible unless we jump through some serious hoops to enable it. We want networks to update, and presumably those networks would like to have the new feature set, so we should clearly keep the barrier low.

Say we make changes to support schema migration or other major governance-based improvements, and maybe that involves new types of transactions or serialization changes or just different logic wherever, it would ideally not require resetting a network to genesis, particularly since defining or distributing genesis data (or network migration tools) is not ready or straightforward.

IMO, the sooner we can shift our development paradigm to supporting indefinite network life, the better, but I full agree this is a ton of baggage in the case of the introduction of procedures. The baggage is especially hard to justify maintaining given 0.7 was breaking with the introduction of PostgreSQL and that it has not been widely deployed yet.

Anyway, it sounds like we need to revive the network migration task. Either the snapshot work can be a basis for creating the genesis data, or some other tools developed to do it via txns. I'm not sure what alternative txn-based tools could be developed to emulate the rebuild with transactions unless the deployed schemas were guaranteed to have a simple insert method.

brennanjl commented 6 months ago

Have you guys taken a look at Cosmovisor? https://docs.cosmos.network/main/build/tooling/cosmovisor

It seems like Cosmos's way of handling this is literally running a new binary and switching to it at a specific height. This seems sort've messy, since you cannot sync state without running Cosmovisor, but interesting nonetheless.

brennanjl commented 6 months ago

IMO, the sooner we can shift our development paradigm to supporting indefinite network life, the better

This is still very much not a fully formed thought, but I'm curious if indefinite network life is altogether something our users would care about deeply. Obviously making them manually resubmit all transactions on upgrade is not an option, but if there was a compromise that led to a literal network reset, but it happened automatically for the user, I think this could potentially be something that is "not ok for a blockchain, but ok for a database".

Still not something I am even fully sold on, just a thought.

jchappelow commented 6 months ago

I think we are committed to genesis data a.k.a. network migrations for v0.8. We'll break consensus more freely between release so we are free to make progress without creating hoops and baggage.

However, we will put in place the machinery to implement coordinated changes to consensus rules so that we are free to make fixes that would otherwise break a network if the rule change did not take place at a specific height. The main purpose of that is patch releases that need to break consensus, but if we happen to avoid large breaking changes between "major" releases (for us, like v0.8 -> v0.9) we can use it there too.

KwilLuke commented 5 months ago

Closed in #782