BlueCobold commented 2 years ago

How frequently does the bug occur?

Seen once

Description

A customer of my app reported suddenly being unable to launch my app. It terminates on first access of the database and it turns out that it is broken for some reason.

It might have broken during a realm migration, but this is uncertain. Newly created files work just fine. I might possibly be allowed to share the db file to a developer for analysis in private, but not in public. I tried to open it with Realm Studio as well and also tried upgrading to Realm 10.25.1, but the file still cannot be decrypted.

Stacktrace & log output


libc++abi: terminating with uncaught exception of type realm::util::DecryptionFailed: Decryption failed
Exception backtrace:
0   Realm          0x000000010b0d349b _ZN5realm4util16DecryptionFailedC2Ev + 107
1   Realm          0x000000010b0b9987 _ZN5realm4util10AESCryptor4readEixPcm + 519
2   Realm          0x000000010b0ba63e _ZN5realm4util20EncryptedFileMapping12refresh_pageEm + 110
3   Realm          0x000000010b0bafee _ZN5realm4util20EncryptedFileMapping12read_barrierEPKvmPFmPKcE + 126
4   Realm          0x000000010ab8e250 _ZN5realm4util26do_encryption_read_barrierEPKvmPFmPKcEPNS0_20EncryptedFileMappingE + 64
5   Realm          0x000000010b0a1822 _ZN5realm11StringIndexC2EmPNS_11ArrayParentEmRKNS_13ClusterColumnERNS_9AllocatorE + 338
6   Realm          0x000000010b08a6b0 _ZN5realm5Table23refresh_index_accessorsEv + 608
7   Realm          0x000000010af533c7 _ZN5realm5Group21create_table_accessorEm + 871
8   Realm          0x000000010af53006 _ZN5realm5Group12do_get_tableEm + 102
9   Realm          0x000000010b1e6287 _ZN5realm12ObjectSchemaC2ERKNS_5GroupENS_10StringDataENS_8TableKeyE + 391
10  Realm          0x000000010b1f0194 _ZN5realm11ObjectStore17schema_from_groupERKNS_5GroupE + 132
11  Realm          0x000000010b2594bb _ZN5realm5Realm32read_schema_from_group_if_neededEv + 187
12  Realm          0x000000010b259268 _ZN5realm5RealmC2ENS0_6ConfigENS_4util8OptionalINS_9VersionIDEEENSt3__110shared_ptrINS_5_impl16RealmCoordinatorEEENS0_13MakeSharedTagE + 456
13  Realm          0x000000010b1b7c2c _ZN5realm5Realm17make_shared_realmENS0_6ConfigENS_4util8OptionalINS_9VersionIDEEENSt3__110shared_ptrINS_5_impl16RealmCoordinatorEEE + 220
14  Realm          0x000000010b1b6294 _ZN5realm5_impl16RealmCoordinator12do_get_realmENS_5Realm6ConfigERNSt3__110shared_ptrIS2_EENS_4util8OptionalINS_9VersionIDEEERNS8_17CheckedUniqueLockE + 532
15  Realm          0x000000010b1b5eaf _ZN5realm5_impl16RealmCoordinator9get_realmENS_5Realm6ConfigENS_4util8OptionalINS_9VersionIDEEE + 495
16  Realm          0x000000010b259ce7 _ZN5realm5Realm16get_shared_realmENS0_6ConfigE + 135
17  Realm          0x000000010ae4d71a +[RLMRealm realmWithConfiguration:queue:error:] + 2314
18  RealmSwift     0x00000001085c3a72 $sSo8RLMRealmC13configuration5queueABSo0A13ConfigurationC_So012OS_dispatch_C0CSgtKcfCTO + 146
19  RealmSwift     0x000000010863fc2f $s10RealmSwift0A0V5queueACSo012OS_dispatch_C0CSg_tKcfC + 127

Can you reproduce the bug?

Yes, always

Reproduction Steps

The database file seems corrupted and cannot even be opened with Realm Studio. I cannot publicly share the file due to the user's privacy, but I might be able to send to a dev in private.

Version

10.10.0 (also tried 10.25.1)

What SDK flavour are you using?

Local Database only

Are you using encryption?

Yes, using encryption

Platform OS and version(s)

iOS 15.4.0, 15.4.1, 15.2.0, 15.2.1

Build environment

ProductName: macOS ProductVersion: 12.0.1 BuildVersion: 21A559

/Applications/Xcode.app/Contents/Developer Xcode 13.3.1 Build version 13E500a

/usr/local/bin/pod 1.10.0 Realm (10.10.0) RealmSwift (10.10.0) RealmSwift (= 10.10.0)

/bin/bash GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21)

(not in use here)

/usr/local/bin/git git version 2.26.0

leemaguire commented 2 years ago

Hi @BlueCobold Can you send the Realm file to realm-help@mongodb.com so we can investigate? The latest version of Realm (10.25.1) contains a fix that should not let this happen again in the future.

BlueCobold commented 2 years ago

I submitted the file in question.

leemaguire commented 2 years ago

@jedelbo successfully recovered the Realm file, @BlueCobold I have sent it to you via email.

BlueCobold commented 2 years ago

Super awesome! The customer will be very happy and so am I. I'll upgrade all app versions out there to realm 10.25.1 and hope for the issue to never return. Thanks!

jedelbo commented 2 years ago

Forensic report: When trying to decrypt the received file, the following showed up:

Checksum failed: 0x90000 0x90000 expected: 0x93 actual: 0x92 Checksum failed: 0x91000 Checksum failed: 0x92000 Checksum failed: 0x93000

Checksum failed: 0xa0000 Checksum failed: 0xa1000 Checksum failed: 0xa2000 Checksum failed: 0xa3000

Checksum failed: 0xa8000

Checksum failed: 0x138000 Checksum failed: 0x139000 0x13900 expected: 0xc0 actual: 0xc3 Checksum failed: 0x13a000 Checksum failed: 0x13b000

Restore old IV: 0x18c000 Restore old IV: 0x18d000 Restore old IV: 0x18e000 Restore old IV: 0x18f000

Restore old IV: 0x198000 Restore old IV: 0x199000 Restore old IV: 0x19a000 Restore old IV: 0x19b000

Restore old IV: 0x1a0000 Restore old IV: 0x1a1000 Restore old IV: 0x1a2000 Restore old IV: 0x1a3000

Restore old IV: 0x1a8000 Restore old IV: 0x1a9000 Restore old IV: 0x1aa000 Restore old IV: 0x1ab000

Restore old IV: 0x1ac000 Restore old IV: 0x1ad000 Restore old IV: 0x1ae000 Restore old IV: 0x1af000

In spite there were checksum errors the content seemed to be consistent except for the 2 cases where a byte value was not as expected. After changing the values back, the file was consistent.

jedelbo commented 2 years ago

@tgoyne the fact that it is the first byte in a 4k block that is modified, does it make us any wiser? An why does the checksum differ if the content apparently is ok?

tgoyne commented 2 years ago

Could possibly be an out-of-bounds write somewhere? The first byte in a buffer is the thing that'll be overwritten if some other piece of code has an off-by-one error when writing to something that happens to land immediately before that buffer in memory. The hmac and actual page data are stored in separate blocks of memory so corrupting one but not the other wouldn't be hard to have happen.

If that is actually the problem I'm not sure what action we can really take. Reread all the encryption code and hope to spot something suspicious that could be writing one past the end? I think the use of MAP_ANONYMOUS for the decrypted buffers unfortunately means that asan doesn't work for them, and it might not even be a bug in our code.

BlueCobold commented 2 years ago

The issue has returned. Again, I have a customer with a database that cannot be decrypted. Since this is on Android, I don't have a proper native stack trace and can only assume it is related to the same incorrect checksum in the native code both systems are based on. I can provide the realm-file, so you can check if it's the same problem. The customer's app version is using the latest Android-Realm implementation, which uses the same native code as Realm-Swift 10.25.1, from what I understand. No migration was involved when the realm file got corrupted.

jedelbo commented 2 years ago

@BlueCobold it would be nice if we would have the possibility to check the realm file to see if the corruption is similar to the first one.

BlueCobold commented 2 years ago

The customer stopped replying and stopped using my app. So I'm afraid, I cannot provide the file.

BlueCobold commented 2 years ago

@jedelbo I submitted another customer's realm file with the same symptoms to realm-help@mongodb.com for analysis.

BlueCobold commented 2 years ago

Using the decrypt-tool in the exec directory, I'm getting the following output: Checksum failed: 0x0 Block never written: 0x55e000 Block never written: 0x55f000

So looks like the first block has issues. The resulting output file is unusable. I have no idea how to get the "actual" and "expected" values that @jedelbo printed in his report, or how to correct possibly faulty bytes to see if the remaining file would be operational. My customer is massively dependent on his data and currently can't access it.

The ticket-bot also seems not to flag this bug-report any longer accordingly. @leemaguire

BlueCobold commented 2 years ago

In the meantime, I checked the decrypted content with a hex editor. Even the damaged first block contains readable strings and thus seems to be decrypted correctly. I imagine there's some header meta-data which is damaged and which makes the RealmBrowser/library believe the file to be still encrypted / unreadable. All other blocks after the first seem to be valid. There are a lot of blocks with readable strings and UUID-tables. From what I assume, the file can be recovered, but I still do not have gathered enough understanding of the internal data structure to make that happen by myself.

BlueCobold commented 2 years ago

I have restored the header with a reference to the top_ref and table_names_ref, but it seems the data is partly scrambled. Some objects have invalid strings which crash Realm when trying to load these objects. Some have fields set to null, which cannot be null (like object-UUIDs for example), but seem to be ok, if I only read this column/field in sequence for the entire table. I wonder, can this potentially be a result of a parallel realm-access which did an automatic compactionOnLaunch?

BlueCobold commented 2 years ago

In further deeper data analysis, I realised some realm-object-keys to be huge. Like '3,402,167,040,181,607,100'. How come they grew so large? Is it possible there's an issue with keys and they spill over at some point or something? Still guessing what could be the reason for badly written pages and wrongly aligned arrays.

jedelbo commented 2 years ago

@BlueCobold I have been away on holiday, and did not see this until now. I can see that you have sent another file for analysis, but I am not sure which key to use for decrypting.

BlueCobold commented 2 years ago

I thought so. I have replied via email to send you the decryption-key. Did you receive it?

jedelbo commented 2 years ago

To which email address should the key have been sent to? I have not received anything.

BlueCobold commented 2 years ago

To which email address should the key have been sent to? I have not received anything.

Sorry, I thought there was a forwarded-reply feature on github-mails. Doesn't look like. I had sent the file and key to realm-help with my mail from 18.07., but I can send you another, including some findings so far - including the partly restored file-header.

jedelbo commented 2 years ago

Great. To be sure that I receive it, you can also send it to jorgen.edelbo@mongodb.com

BlueCobold commented 2 years ago

I sent it along with a few of my own findings. Thanks for your help.

jedelbo commented 2 years ago

I tried to decode the file, but the up until 0x2a0, I see only something that looks like random bytes:

00000000  24 bd 76 a9 91 68 46 00  c8 8e 16 5c 07 75 51 00  |$.v..hF....\.uQ.|
00000010  b8 7d a0 47 0e e4 52 00  cc f9 d1 68 d3 f8 53 00  |.}.G..R....h..S.|
00000020  25 fa e5 41 5b 5f 55 00  ba d5 69 72 e5 60 58 00  |%..A[_U...ir.`X.|
00000030  51 86 e8 5a e8 9a 59 00  15 7c 65 32 91 92 5c 00  |Q..Z..Y..|e2..\.|
00000040  b7 72 f6 6d 43 7e 5f 00  af 65 50 ff 80 d3 5f 00  |.r.mC~_..eP..._.|
00000050  94 6f 97 53 a8 e8 5f 00  a9 ed 50 99 26 79 60 00  |.o.S.._...P.&y`.|
00000060  9a 28 ea 36 f5 71 62 00  cf 55 bf 31 07 ca 64 00  |.(.6.qb..U.1..d.|
00000070  d4 04 32 f9 c3 37 65 00  87 ec 01 5a cc fc 65 00  |..2..7e....Z..e.|
00000080  97 65 fa 62 3e af 67 00  bd 4b 71 af fb 24 6c 00  |.e.b>.g..Kq..$l.|
00000090  88 59 45 e9 f8 e5 6d 00  6a af fe 39 9c 2c 70 00  |.YE...m.j..9.,p.|

Does this match your findings?

BlueCobold commented 2 years ago

Yes, exactly my results as well. After that block, it seems to be mostly valid data. That's why I manually restored the header as I wrote in the previous mail. As I said, it contains some corrupted data entries and references, but no more of this byte junk.

The "trash" at the beginning is not actual trash, though. Check the 00 every 8 bytes. I assume it's an array of 64-bit values. Maybe realm-object-keys. The same data can be found at another offset in the file. For example the entry "964C406A 5059B000" from offset 0x1D0 appears again at offset 0x991D0. Which... means they are exactly 0x99000 bytes apart in offset.

BlueCobold commented 2 years ago

The duplicated data starts originally at 0x98EC0, a valid array. And then is "duplicated" into the header, making the file unusable.

jedelbo commented 2 years ago

Those are great findings. I am a bit embarrassed that I did not spot the zeroes. I hope it can help us further with this issue. It is very common to have duplicated data. Whenever some part of an array is modified, a new version of the array is created by copying the whole array. I will try to see if I can find the "true" top ref.

BlueCobold commented 2 years ago

It is very common to have duplicated data. Whenever some part of an array is modified, a new version of the array is created by copying the whole array.

Yea, I figured that much. It makes sense from a transaction perspective.

I will try to see if I can find the "true" top ref.

That would be great.

Also, if you don't mind, I pointed out the very large object-keys for many objects above. (a few objects have two-digit-keys which seem to be auto-increment style, so the big ones make me wonder what's going on) Is it normal for objects to have such large keys or does that indicate a problematic way of using Realm? Can keys accidentally overflow or does Realm auto-detect free keys during object creation when the max value is reached?

BlueCobold commented 2 years ago

I found the following cluster-tree, related to table realm/realm-swift#10 at offset 0x1192A0: 41414141 4700000C 40870800 00000000 00000000 00000000 D80C0000 00000000 15000000 00000000 A8481700 00000000 03000000 00000000 A1520000 00000000 38650200 00000000 90600200 00000000 6950CC0E 1FBFCF7A 00000000 00000000 01000408 05000000

It contains a lot of very suspicious refs like 03000000, 05000000 or 15000000 These refs would mean they are within the header-bytes for the Realm-file when they get written to! This makes me worry a lot about data consistency.

jedelbo commented 2 years ago

What you have found here is the table top array. It contains both refs and numbers. If the entry has the LSB set (like 0x15) it is a number. You get the value by shifting down one bit so in this case it is 10, which matches table number 10.

jedelbo commented 2 years ago

I am somewhat convinced that the first 24 bytes of the file should be

00000000  80 6c 51 00 00 00 00 00  f0 53 51 00 00 00 00 00  |.lQ......SQ.....|
00000010  54 2d 44 42 16 16 00 00                           |T-DB....|

making the top ref 0x516c80

jedelbo commented 2 years ago

I am pretty sure that the problem is that the first 0x1000 bytes have been overwritten with a page that should have been written somewhere else. Unfortunately a lot of refs points into this area, so recreating meaningful data in this area would be some major puzzle.

BlueCobold commented 2 years ago

I am currently trying so solve this puzzle already by skipping invalid data. Table realm/realm-swift#10 seems to be majorly affected by it, but I could probably skip it. I "fixed" some other table entries already by detecting invalid string-offsets and nulling invalid references into the first 0x1000 bytes. It still means losing a lot of data that cannot be restored. My major concern is now to prevent this from happening again in the future by all means, because it affects not just one customer by now - I only have access to his file though, because the others didn't report to me, I just received their crash reports and bad customer feedback in the app/playstore. I don't know if I could accidentally have caused this myself, but from a developer perspective, using the API should never result in corrupt file like this.

BlueCobold commented 2 years ago

Table realm/realm-swift#10 is set as: 41414141 4700000C 40870800 00000000 00000000 00000000 D80C0000 However, the 0xCD8 is broken due to the overwritten first page. But 0x088740 is also invalid, I think. It points to some structure, but it will never find a valid table-name-ref, which is located at 0xAC790.

jedelbo commented 2 years ago

0x088740 seems to be ok. 0xAC790 (the column names) are linked from index 1. It will be hard to guess how the cluster that should be at 0xcd8 should look like.

BlueCobold commented 2 years ago

Oh, my bad.

I think found the array which contains the object-keys for table realm/realm-swift#10 at offset 0x11D8: 41414141 07000008 D87850E2 0EBD5900 4B494198 246F8312 AC98E903 046B281B 6B8647F6 DEA6CA21 9AE058AC B3B86125 F496A0FD B0240B32 275D4BC8 12EA8532 33286687 8FDF673D And I think this could be the array containing the uuid column values for table realm/realm-swift#10 at offset 0x787e8 41414141 11000128 65613733 32653461 2D666466 392D3435 63342D39 6563612D 64393735 62633064 36353063 00663062 64323564 612D6532 30652D34 3332312D 39336531 2D333864 31353439 39363533 30006662 37303331 32632D32 3863332D 34623663 2D616466 312D3461 66663963 35643763 32360034 38376434 3963372D 33393463 2D346564 612D3933 64632D65 33393135 34313537 34346300 61323334 39636635 2D346566 372D3465 33352D39 3363622D 66383133 31633533 38386365 00323233 30343431 312D6435 30342D34 6532642D 39663133 2D646139 66646439 32373534 63006135 33653961 36612D62 3636322D 34343233 2D626536 612D3164 63636234 36396265 61640063 33333261 6462662D 31343236 2D343365 622D6264 61322D38 36336461 32623735 33373300 Looks like all other data I can extract 2 JSON at this point (except for a few minor strings and entries I skipped). Except table realm/realm-swift#10. It only contains 8 entries of which only the "color" column would be important to restore.

BlueCobold commented 2 years ago

I believe I also found the array which contains the "color" column of table realm/realm-swift#10 at offset 0x107BB8 41414141 06000008 1B6392FF 253E3AFF 528498FF AADDEEFF 03AEECFF 415052FF 27332DFF 9D6300FF

Edit: Nope, I think this one is related to table realm/realm-swift#14, sadly. So maybe 'color' column for table realm/realm-swift#10 is lost, cause it should start with 41414141 06000008 and the highest byte of each integer should be FF or 00 and this is the only array which adheres to this. But since it is referenced in a 14-column array of which one points to strings I recognise only from table realm/realm-swift#14, I assume it's table realm/realm-swift#14 instead. Much sad.

BlueCobold commented 2 years ago

@jedelbo Another customer sent me a realm file which causes this when trying to write or delete a specific value to/from it and I worry it may be related: `Build fingerprint: 'google/sdk_gphone_x86/generic_x86:9/PSR1.180720.012/4923214:user/release-keys' Revision: '0' ABI: 'x86' pid: 24949, tid: 25253, name: DefaultDispatch >>> de.game_coding.trackmytime <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4

Cause: null pointer dereference eax 00000000 ebx cd561880 ecx 00000000 edx cd5618a0 edi c73d0468 esi 00000000 ebp cb681f28 esp cb681ef0 eip caeea57c

backtrace:

00 pc 005db57c /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so

realm/realm-swift#1 pc 005db905  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#2 pc 005dc124  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#3 pc 0065b985  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#4 pc 006b004c  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#5 pc 005d399c  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#6 pc 005d2af3  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#7 pc 006c6b6d  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#8 pc 003db034  /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so (Java_io_realm_internal_Table_nativeSetLong+372)
realm/realm-swift#9 pc 0006e100  /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.internal.Table.nativeSetLong+224)
realm/realm-swift#10 pc 00061470  /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.de_game_coding_trackmytime_storage_inventory_ProductDbRealmProxy.insertOrUpdate+2512)
realm/realm-swift#11 pc 00066507  /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.de_game_coding_trackmytime_storage_inventory_ProductCategoryDbRealmProxy.insertOrUpdate+2103)
realm/realm-swift#12 pc 0006ed20  /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.DefaultRealmModuleMediator.insertOrUpdate+2560)

`

jedelbo commented 2 years ago

@BlueCobold It might be related, but the stack trace does not make us any wiser.

BlueCobold commented 2 years ago

@jedelbo Thought so, but I thought I provide what I can. Do you want that file for analysis? (It doesn't need recovery, I made a re-import of its data into a fresh one, but I can offer it to you if it may help to identify bugs.)

jedelbo commented 2 years ago

@BlueCobold All files are welcome. Maybe it contains that piece of information that can help us further.

jedelbo commented 2 years ago

There seems to be two kinds of problems related to this issue. One is that some refs are not updated correctly. This is probably happening above the encryption layer. Another problem is that an encrypted page is written in wrong location resulting in that the first page in the decrypted file contains data that should have been somewhere else.

BlueCobold commented 2 years ago

Sounds like some serious issue with multithreading then and/or with internal reference/pointer handling in realm_core. Doesn't it?

BlueCobold commented 2 years ago

@nicola-cab The release notes in realm-core say the PR fixes an issue which exists since v11.8.0. However, this bug-report was already made in v10.10.0. So either the commit doesn't fix it or the release-notes are incorrect.

BlueCobold commented 1 year ago

@nicola-cab @jedelbo Just to update this a bit: Since I single-threaded queued all my Realm-operations, the issue seems to be gone in both the Android and iOS versions of my apps in production (about 10k users in total). Since the issue existed on both platforms, I don't think the iOS-only fix #5993 can have solved this issue. First of all, because it is iOS-only and also because it seems to be related to crashing platforms. This does not go hand in hand with my multi-thread/single-thread observation, because the amount of crashes should be the same and the number of my corruptions should not have been reduced by my change.

realm / realm-core

Decryption failed - page zero has wrong checksum #5810

How frequently does the bug occur?

Description

Stacktrace & log output

Can you reproduce the bug?

Reproduction Steps

Version

What SDK flavour are you using?

Are you using encryption?

Platform OS and version(s)

Build environment

00 pc 005db57c /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so