ssbc / ssb-db2

A new database for secure-scuttlebutt
47 stars 8 forks source link

Broken seq.index, keys index, and duplicate records on log #291

Closed staltz closed 2 years ago

staltz commented 2 years ago

This issue has the same symptoms as https://github.com/ssb-ngi-pointer/jitdb/issues/161, and seems to be happening only on iOS, I haven't gotten bug reports for Android yet.

I got the db2 folder from Dave who has this problem and ran some scripts to study the data. Mostly, I'm concerned with the keys leveldb index, because when retrieving a message by a certain key, we get the wrong key.

For all records on the log, I printed out:

seq | offset | key in the keys leveldb index | key in the msg, if different
0 | 0 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1 | 526 | %LqzwwHg3Lt+uU4S3DXdE8jkBfvma/1Yin1mrDzPF4V8=.sha256
2 | 995 | %V3/jZdpj1CsyVI3QH/rajkDahH1R30u1vf05OUA8h8U=.sha256
3 | 1464 | %Ctc7zcaT0N4Myvm3/+VEPItu1/1x0SJHysrCoJQzmUA=.sha256
4 | 2297 | %GNNx+fj9DmdMlazwqh439K+j0G1ovAGr4stelJj4EkM=.sha256
5 | 2766 | %PjwtwmxcVOyAZ399Z1Qy4og+z+sl8/P+Lfq3D/zV77w=.sha256
6 | 3250 | %p3aECWInrsJZdSO7Sp7CQR3a29MpIuVrAxwXG2tSTl0=.sha256
7 | 3736 | %m/OBdukgm7bDwQaT1P2aOHFXoLo5yzW7bfqy7JfrZeE=.sha256
...
1380993 | 992019409 | %bXb5tFQpLYK0kFOiOqhZRoJZoFsTBRLCWbCpIWNfUvY=.sha256
1380994 | 992019902 | %Fe9zF4gDp0a0EgvCSsaWmTgg1eHMKh5mKeDBnFr57LQ=.sha256
1380995 | 992020386 | %gp8jQXiLBKrUMuFWFjizZDUAD6FdSC694JbvPk8og2I=.sha256
1380996 | 992020870 | %7O1RVgAC1DGTieWj3dPY6ECyweQNV3XGsP0d3fvO7sM=.sha256
1380997 | 992021367 | %fsuepooJoYmmY5wrz8tsiUI6LKbAODDI39Fdcp7feGc=.sha256
1380998 | 992021851 | %RI616dra+5k4fCAL7OO6HcppXvejgh99xMc45p1qZEQ=.sha256
1380999 | 992022335 | %Cau8bVqsaWC98bwwYV7/dn4FNJ8KjFEj2PFI7D8kVHA=.sha256
1381002 | 992023787 | %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256
1381003 | 992024723 | %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256
1381004 | 992025198 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256
1381005 | 992025940 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256
1381006 | 992026945 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256
1381007 | 992027429 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256 | %EcegYJ7ukIvt+6KMmhr3XGEhPIhrV+5ZyEhLPxJEsg4=.sha256
1381008 | 992027922 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256 | %UtpB04gXRGFiC7tim0FSGCnggAuCy3wDgQ1CM+Pq4N8=.sha256
...
1384614 | 994561122 | %M0d0QBvh5wd+4ogtz4cN2Whgi2mxc1pdMb+ybjxnRJM=.sha256 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256
1384615 | 994561591 | %tSWBFyjCin8d9FviNrRiOAUJGPWABcLbgvheUnoqhvk=.sha256 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256
1384616 | 994562060 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256
1384617 | 994562544 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256
1384618 | 0 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1384619 | 0 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256

If you look closely, there seems to be an off-by-two gap. Notice these in bold (and that seqs 1381000 and 1381001 are missing):

seq offset actualKey expectedKey
1380998 992021851 %RI61
1380999 992022335 %Cau8
1381002 992023787 %XFqG %uBNh
1381003 992024723 %cBQm %CuDp
1381004 992025198 %uBNh %NieK
1381005 992025940 %CuDp %M90y

Also, if I simply load some index files, the metadata says (notice the Entries):

Loaded seq.index with:
  Entries: 1384618
  Offset: 994562544

Loaded keys leveldb with:
  Version: 1
  Entries: 1384620
  Offset: 994562544
arj03 commented 2 years ago

If one deletes the level indexes and rebuild, they also end up at 1384618 entries. And there seems to be corruption in the jitdb indexes as well. I'll do some more testing.

arj03 commented 2 years ago

This got me thinking. What version of node is the ios running? I'm wondering if it might have something to do with the M1 chip. It seems like node 16 was the first official release to support the Apple silicon.

staltz commented 2 years ago

Node.js Mobile (on both Android and iOS) is 12.19.0. iPhones don't have (I think) M1 chips, it's only on macBooks, iMac, and iPad Pro.

arj03 commented 2 years ago

Ah right, it is A15 in the latest iphone.

staltz commented 2 years ago

Since you rebuilt the indexes, could you tell me what do you see for the record at seq 1381000 and 1381001 and 1381002? I'd like to know what those 3 rows should have been. And I'm curious what message has the key %XFqG...

staltz commented 2 years ago

Idea: even if we don't discover the culprit, we can still build a system to discover inconsistencies and automatically heal them. Like, if the number of entries in leveldb doesn't match the core indexes in jitdb, then rebuild parts of the leveldb index. (Of course, not fixing the actual culprit will then make it harder for us to debug the culprit, and this issue could linger on forever). But just an idea.

arj03 commented 2 years ago

Those 3 are: %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256, %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256, %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256.

One thing I noticed is that if I get the messages for my profile I get only 1 message using the corrupted indexes. While clearing the whole thing I get > 8000. This is where I discovered the multiple indexes are corrupt. One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

What I also find really strange is that it seems that both level & the jitdb indexes are corrupt.

staltz commented 2 years ago

Oh, which jitdb indexes you detected were corrupted?

arj03 commented 2 years ago

I tried running a query for my posts messages, I havn't gotten exactly to the bottom of it yet, but at least the author + post and one other index was corrupt.

staltz commented 2 years ago

One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

Here goes some detective work:

The last jitdb without CRC was jitdb@3.0.3 and ssb-db2@2.0.2.

In Manyverse, we forced the rebuild of all leveldb indexes (not jitdb indexes) recently:

https://github.com/staltz/manyverse/blob/215b42f4c9931400ccbc3130834f956004f2471d/src/backend/ssb.ts#L35-L38

But we forced the rebuild of all jitdb indexes (with jitdb@3.1.3 and ssb-db2@2.1.2) for Android only with this commit because we thought that (see issue https://github.com/ssb-ngi-pointer/jitdb/issues/149#issuecomment-819011645) this issue afflicted only Android with ext4:

https://github.com/staltz/manyverse/commit/82a65c58fc078622a4b024add405b620c7895a3c

Note, CRC didn't work on iOS, so I made this patch to jitdb and finally jitdb@3.1.4 supported CRC on iOS.

And then I updated the "rebuild all jitdb indexes" logic in Manyverse to also include iOS:

https://github.com/staltz/manyverse/commit/7814647e75e3c569eaad97cc8091215579671735

And I also updated jitdb@3.1.4:

https://github.com/staltz/manyverse/commit/f033099efd07ccd26ca58979bcddc4048ffa3d56

Both of the commits above were shipped into Manyverse version 0.2105.11, so I believe that iOS users had all of their indexes rebuilt and CRC-checked, but maybe something sneaky happened that I can't see?

arj03 commented 2 years ago

That is interesting because I checked the seq numbers of my posts messages in dave's log and I can see that they go way back (they start at seq 8017). This means that they should have been reindexed last month when you made that release. I tried dumping the raw author index and can see that it is vastly different from my reindex on even the first seq. And this is a prefix index, so once a seq has been reindexed, it should never change. The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

staltz commented 2 years ago

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

davegomez commented 2 years ago

For what is worth (it looks it might), I updated my phone to the latest version last month.

iOS automatically copied the data content of the apps during the migration, so I didn’t have to install and sync Manyverse again.

I can confirm I noticed the issue after the migration to the new phone.

staltz commented 2 years ago

Very interesting! Thanks!

Just for more context, though, there is another user who is experiencing the same symptoms, so I think it's a bit more general issue, maybe they didn't change their phone too.

arj03 commented 2 years ago

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

Oh, I misread the version numbers. I thought 11 meant the month :)

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

That shouldn't happen and the folder should have been deleted before anyway.

staltz commented 2 years ago

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first. Because it seems that once one is wrong, it could infect the other indexes with wrong data.

staltz commented 2 years ago

Can it be that the problem is that canDecrypt is not CRC-checked?

arj03 commented 2 years ago

Sorry for the deleted message, I spoke to early, was running on my own profile and needed to check if it also is a problem with another one. So I tested cblgh and I see the same problem. I get the correct number of results by removing author + type post + canDecrypt index. In other words they are all 3 corrupted.

staltz commented 2 years ago

Can it be that the problem is that canDecrypt is not CRC-checked?

Still seems like we could do this, right? canDecrypt.index and encrypted.index are currently not CRC'd.

arj03 commented 2 years ago

Sure that would be a good idea.

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

staltz commented 2 years ago

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

Can you detect whether a corruption is due to "bad data" coming from another corrupted index, or whether all of the corruptions are "original corruptions"? It would be easier for us if we detected that just canDecrypt.index and encrypted.index were corrupted and then all of the other indexes got infected with that.

But it could surely be a general filesystem-caused corruption.

staltz commented 2 years ago

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first.

Hmm, I can't do this without unless I have Dave's secret keys (in order to create canDecrypt.index). I guess I could ask Dave to run some scripts using his secret key.

arj03 commented 2 years ago

After further testing I'm not so sure that canDecrypt is actually corrupt. Let me do some more testing on the author + post that is for sure corrupt.

staltz commented 2 years ago

Yeah, it's actually hard to know if canDecrypt is corrupted unless we get Dave to rebuild it on desktop.

staltz commented 2 years ago

PS: Dave is gonna run this https://github.com/staltz/ssb-db2-issue-291

arj03 commented 2 years ago

Oh my. It turns out we might have to be really careful with bipf when upgrading. I was testing with latest bipf + ssb-db2. Pinning db2 to 2.3.3 and bipf to 1.5.1 seems to produce the correct results for author + post. So at least that is good news. Maybe the problem is only level db then. It will be interesting to see you find running the script you linked.

arj03 commented 2 years ago

Seeing as I'm really slow today 😩 can you tell me when this fix: https://github.com/ssb-ngi-pointer/ssb-db2/commit/557e8821f9e3483398c882dcbc735b10a9b477a0 was put into manyverse and the level folders reindexed?

It seems like 2.3.2 and 2.3.3 is not tagged in ssb-db2 git?

staltz commented 2 years ago

Updated to ssb-db2 2.3.2 in https://github.com/staltz/manyverse/commit/ebe4bf32675759aba0687e264b3f557e15444c04 and released as Manyverse 0.2110.22

staltz commented 2 years ago

Updated to ssb-db2 2.3.3 in https://github.com/staltz/manyverse/commit/41bb9ea24ad5884b61ecd1bf01f869235d621ca1 and released as Manyverse 0.2111.11

arj03 commented 2 years ago

We made so many changes, I almost forgot half of them... ;)

The reason I was so confused is that the author index change for bendy butt bumps the version to 2, but that is only taken into consideration when updating the index ;-) So nothing related to bipf.

Stepping back I don't think this has anything to do with index corruption. It more smells like a bug similar to the level bug. Maybe related to pausing in push-stream, that is just a guess because I noticed there is also 1 message missing from the type index. Sequence 1382999. So the following query will give you the message:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)

while this will not, until you rebuild:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
      type('post'),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)
staltz commented 2 years ago

Good, we're getting closer.

I don't think this has anything to do with index corruption. It more smells like a bug similar to the level bug. Maybe related to pausing in push-stream, that is just a guess because I noticed there is also 1 message missing from the type index.

That's my intuition too. Off by one (or off by two) error seems like a logic bug somewhere.

staltz commented 2 years ago

Work-in-progress analysis of some indexes (works as note-to-self too):

status is about the diff between rebuilt indexes and bad indexes.

index status comment
seq.index :heavy_check_mark: good except CRC mismatch due to different amount of trailing zero bytes
sequence.index :heavy_check_mark: good except CRC mismatch due to different amount of trailing zero bytes
timestamp.index :heavy_check_mark: good except CRC mismatch due to different amount of trailing zero bytes
canDecrypt.index :heavy_check_mark:
encrypted.index :construction: wasn't rebuilt fully (just a few entries rebuilt)
meta_.index :x: already the first bytes in the tarr body differ, but this may be due to me building this index with a random ssb-key that cannot decrypt messages, so it never finds the meta.private section
value_author.32prefix :x: rebuilt has count=1384618 and bad has count=1384111
staltz commented 2 years ago

If you look only at the keys leveldb index, then it's clear that the seq was too much ahead of what it should have been. I've been thinking how is it possible that it was ahead, and that's only possible if this.processRecord(record, processedSeq) below

    this.onRecord = function onRecord(record, isLive) {
      let changes = 0
      if (record.offset > this.offset.value) {
        if (record.value) this.processRecord(record, processedSeq)
        changes = this.batch.length
        processedSeq++
      }
      processedOffset = record.offset

      if (changes > chunkSize) this.flush(thenMaybeReportError)
      else if (isLive) liveFlush(thenMaybeReportError)
    }

Was called with processedSeq higher than what it should be. In what cases can processedSeq be too much ahead?

I think only prior to this PR https://github.com/ssb-ngi-pointer/ssb-db2/pull/269

   // 1st, persist the operations in the batch array
   this.level.batch(
     this.batch,
     { keyEncoding: this.keyEncoding, valueEncoding: this.valueEncoding },
     (err2) => {
       if (err2) return cb(err2)
       if (this.level.isClosed()) return cb()

       // 2nd, persist the META because it has its own valueEncoding
       this.level.put(
         META,
         { version, offset: processedOffsetAtFlush, processed: processedSeqAtFlush },
         { valueEncoding: 'json' },
         (err3) => {
           if (err3) cb(err3)
           else {
             this.offset.set(processedOffsetAtFlush)
             cb()
           }
         }
       )
     }
   )

In the code above, lets assume the app crashes after persisting the batch but before persisting the META. Then, next time it loads up, the processedSeq will be behind of what it should be. For it to be ahead, we would need that concurrency problem to exist.

So this tells me that perhaps there is nothing new about this bug/issue, it's still the same concurrency issue that we patched. But perhaps the leveldb indexes weren't rebuilt on iOS, maybe this line didn't work on iOS (was it the /!(*.*) syntax?):

https://github.com/staltz/manyverse/commit/ebe4bf32675759aba0687e264b3f557e15444c04#diff-b206684c06304b6e2482cfd81d97e032c38995cd9ca57f69dd65293df57b405aR36

staltz commented 2 years ago

I don't think there is a problem with the rimraf and /!(*.*) unless iOS is doing something super weird. I ran the following code below (on my linux desktop) and it deleted the leveldb folders as it should:

const path = require('path');
const rimraf = require('rimraf');

rimraf.sync(path.join(__dirname, 'db2', 'indexes') + '/!(*.*)');
arj03 commented 2 years ago

I also double checked it :) Can you check on iOS using a simulator?

This bug is really strange. Either it is something silly like the above, or it is something a lot more complex. I'll try to write up my thinking today.

arj03 commented 2 years ago

If it is the complex case then here are some of the things I have tried and my thinking.

Seeing that we both have the problem in type index and key level index, it seems like it might be lower level than ssb-db2. And it is related to updating the indexes, meaning log.stream. Luckily there are not that many places it happens.

JITDB:

SSB-DB2:

I tried a few things. I've been poking around in AAOL to see if I could see anything. Checked push-stream lightly.

I tried running a full reindex on the log and checking the sha1 sum on author and type index as the base case. Then ran a few reindexes killing the process while running. Next I ran a few reindexes with too hot enabled. In all cases they ended up with the same sha sum, meaning they are 100% identical to the base case.

One thing that I thought might be a bit sneaky is if, for some reason, the same values is returned twice from the underlying level. Then we might get these seq of by 1 or 2.

staltz commented 2 years ago

I ran the rimraf.sync check on iOS (iPad, not simulator), like this:

mkdirp.sync(path.join(process.env.SSB_DIR, 'foo', 'ax'))
mkdirp.sync(path.join(process.env.SSB_DIR, 'foo', 'bx'))
mkdirp.sync(path.join(process.env.SSB_DIR, 'foo', 'cx'))
console.log("BEFORE", fs.readdirSync(path.join(process.env.SSB_DIR, 'foo')))
rimraf.sync(path.join(process.env.SSB_DIR, 'foo') + '/!(*.*)')
console.log("AFTER", fs.readdirSync(path.join(process.env.SSB_DIR, 'foo')))

And the result was:

BEFORE [ 'ax', 'bx', 'cx' ]
AFTER []

So at least we can rule out the idea that iOS is doing rimraf.sync wrong.


One thing that I thought might be a bit sneaky is if, for some reason, the same values is returned twice from the underlying level.

What do you mean by "values"?

Also, it could be even lower-level, like a filesystem problem. Note that iOS has APFS.

arj03 commented 2 years ago

Like here: https://github.com/ssb-ngi-pointer/async-append-only-log/blob/abbd1887d3ef414d1c1f67f49c70244c50298ebb/stream.js#L96.

arj03 commented 2 years ago

One really needs to be careful with this debugging. I was wrong about type_post, I had run it with a problematic key. So type post is okay, the same is type vote and root index. I checked author again, as you also have identified, and it appears to go wrong around seq no 1384113. This means that a query for author @CxnSXWYjPT160y7QbmTFtWaWT09080azgErYPt1ZeZc=.ed25519 gives a different result with the corrupted database. 6071 versus 6095. Where did you get the count=1384111 from? I'm printing count from here.

staltz commented 2 years ago

Where did you get the count=1384111 from? I'm printing count from here.

I think I got it from here https://github.com/staltz/ssb-db2-issue-291/blob/main/study-prefix-map-diff.js

But one technique I've been using sometimes is

cat db2/indexes-bad/value_author.32prefix | xxd | head > bad.hex

To look at the file header and see the version | offset | count | crc | body bytes

staltz commented 2 years ago

Ok, I got a report from Powersource that his (manyverse) desktop app is throwing some pranks too, it seems the db is corrupted there! I got the zip of his db2, I can share it with you later.

This would be interesting because it wouldn't be iOS-specific anymore.

arj03 commented 2 years ago

Jacobs keys level has the same kind of bug. It is off by 38 instead of 2.

1021983 | 788241486 | %/DnmQa4xqgIn5qaKgOx8ioUrU8i+ybE+hBFhWCVAuq0=.sha256
1021984 | 788242170 | %5nd7p453VUbLtm/oiWTZC8GPzAKWDNGOxz9e0MWBcF0=.sha256
1021985 | 788242654 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256 | %D7CEf5EgPxX5/sWWgD2JREfpU3zShi85+eFsAhprwkc=.sha256
1021986 | 788243123 | %07shhpv4L6wEWpZxZdm0e5zU2QYo+o76iOgeIelSj9s=.sha256 | %LCC8qCZsZbqjKnQ0t0g+T59l15dD4wzDlMiPq+KBanc=.sha256
1021987 | 788244160 | %xc83mi/928R2MfwNP7DKLB/y5VMb2QZrlaojcrd5jpI=.sha256 | %UtAI9IUZn+MHKAYvdkc+7LU/FRXPJIGlklxkYTUtGo8=.sha256
1021988 | 788245071 | %6mFJx4VxZGswkf89liwNXAS1g0u/7CZ1qKxroremvMg=.sha256 | %j5hkP6mW4uvu0x6ujP1nY4Jdf2nHbyS7wb/KEK9pIJM=.sha256
1021989 | 788245555 | %fwRDVfejokKNk9b+P9mEBEazvIudeLXkyh48LMa1bX0=.sha256 | %+oGNg87aNM5SR404xhKFwRI8R3HcE8VqO1t4vcotc/A=.sha256

from correct:

1022023 | 788267008 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256

these are the 38 missing, when compared to the total

I have also noticed that his author index is correct, but instead seq.index is broken. The following 42 offsets are missing:

788242654
788243123
788244160
788245071
788245555
788246056
788246525
788247018
788247502
788247995
788248717
788249210
788249992
788251454
788252162
788252646
788253121
788253918
788254393
788254868
788255352
788255836
788256320
788256813
788258017
788258497
788258981
788259957
788260906
788261331
788262059
788262543
788263012
788263505
788264349
788264818
788265302
788265795

I have a theory what might be have caused the problem with seq.index. It just doesn't explain the author corruption or the level bugs. What is a bit puzzling about the corrupt author prefix index is that, if you look at the prefixes after it is corrupted, they don't point to any known authors starts.

staltz commented 2 years ago

Studying Jacob's base leveldb, I noticed there are only two wrong KVs:

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 793494573, sequence: 1810}
key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 793494080, sequence: 191}

Correct ones (they differ only by offset):

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 788258981, sequence: 1810}
key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 788265302, sequence: 191}

And the META for base is:

Bad base:

  Version: 2
  Entries: 1032716
  Offset: 796004667

Rebuilt base (differs only by entries):

  Version: 2
  Entries: 1032754
  Offset: 796004667

38 is the diff.

arj03 commented 2 years ago

Hmmmmmmmmmmmm good find. Why does Jacob has the same message multiple times?

Check this (I exposed the getMsgByOffset method):

ssb.db.getMsgByOffset(788258981, (err, msg) => {
  console.log(788258981, msg)
})

ssb.db.getMsgByOffset(793494573, (err, msg) => {
  console.log(7934945731, msg)
})

ssb.db.getMsgByOffset(788265302, (err, msg) => {
  console.log(788265302, msg)
})

ssb.db.getMsgByOffset(793494080, (err, msg) => {
  console.log(793494080, msg)
})

=>

788258981 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625571799
}
7934945731 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625912978
}
793494080 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625912974
}
788265302 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625571896
}
staltz commented 2 years ago

Wait, this is literally just on the log, no indexes involved?

arj03 commented 2 years ago

Exactly

arj03 commented 2 years ago

I wonder what came first, the corrupt indexes or double messages 🤔

staltz commented 2 years ago

Well, one thing we know for sure, that seq.index has a saving problem. We could push that fix and reset all indexes, and then wait for more months to pass by and see what happens.

arj03 commented 2 years ago

Agree that it would be good with that fix. I'm just still trying to understand what is going on here.

I found some duplicate messages in Daves database as well (key, first offset, duplicate offset).

duplicate key %z/MDV4mT/XaiIibeP0C5vf+0tKLAcFxtMBDHLf5eu+E=.sha256 180348963 180749729
duplicate key %+hW4qOX4xuIO8QE9RTUAaHqL0bD28r4GyF/kYJLBz38=.sha256 180295907 180753120
duplicate key %/gzGlFUQBE3C6Kd1QfQSzZxG78IL4DmJssBIIdQCnlI=.sha256 180328614 180769814
duplicate key %NWG837v3J2qhjGmK9yEZvH7gZ34fzlEKfKjeuWXmRG8=.sha256 180371842 181260814
duplicate key %3Z/wvMZQlLWR+pSMn9yERhxwNU6jVrDQGqbvZV+sEZM=.sha256 180435314 245829508
duplicate key %rRUeWOTw1GQBHPJCwAsYSnuFCjRaLh/vafGZJnoA56o=.sha256 180472215 276944048

I checked all of Jacobs and there are only the 2 you already found.

arj03 commented 2 years ago

First of all I have been really trying to break AAOL by threating it as a black block and throwing all kinds of random input at it. I havn't seen it break yet. So my hunch is that it must be somewhere else.

Good news is I think I have found something. Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their processedOffset thus lovering it compared to processedSeq. If you then kill the app while some of them have updated, but NOT keys or base then there will be a mismatch!