matrix-org / matrix-rust-sdk

Matrix Client-Server SDK for Rust
Apache License 2.0
1.19k stars 232 forks source link

Error: invalid type: newtype struct, expected any valid JSON value at line 1 column 1061 #3557

Open CyberShadow opened 2 months ago

CyberShadow commented 2 months ago

I wrote a Matrix bot using this library a while ago. Today, it won't start.

On startup, it just prints:

Error: invalid type: newtype struct, expected any valid JSON value at line 1 column 1061

and exits.

I'm guessing some file got corrupted. But how can I find which?

My usual debugging methods failed me.

Thread 1 "matrix_bot" hit Breakpoint 1, 0x00007ffff751a170 in write () from /usr/lib/libc.so.6
(gdb) where
#0  0x00007ffff751a170 in write () from /usr/lib/libc.so.6
#1  0x0000555557d29ebd in std::sys::pal::unix::fd::FileDesc::write () at library/std/src/sys/pal/unix/fd.rs:264
#2  std::sys::pal::unix::stdio::{impl#5}::write () at library/std/src/sys/pal/unix/stdio.rs:71
#3  std::io::Write::write_all<std::sys::pal::unix::stdio::Stderr> () at library/std/src/io/mod.rs:1706
#4  std::io::stdio::{impl#2}::write_all () at library/std/src/io/stdio.rs:176
#5  0x0000555557d2bd59 in std::io::stdio::{impl#21}::write_all () at library/std/src/io/stdio.rs:976
#6  std::io::Write::write_fmt::{impl#0}::write_str<std::io::stdio::StderrLock> () at library/std/src/io/mod.rs:1835
#7  0x0000555557d5873c in core::fmt::write () at library/core/src/fmt/mod.rs:1118
#8  0x0000555557d2a67b in std::io::Write::write_fmt<std::io::stdio::StderrLock> () at library/std/src/io/mod.rs:1846
#9  std::io::stdio::{impl#20}::write_fmt () at library/std/src/io/stdio.rs:956
#10 0x0000555557d2aab2 in std::io::stdio::{impl#19}::write_fmt () at library/std/src/io/stdio.rs:930
#11 std::io::stdio::attempt_print_to_stderr () at library/std/src/io/stdio.rs:1057
#12 0x00005555559d9491 in std::process::{impl#61}::report<(), anyhow::Error> (self=...)
    at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/process.rs:2443
#13 0x0000555555b720b9 in std::rt::lang_start::{closure#0}<core::result::Result<(), anyhow::Error>> ()
    at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/rt.rs:166
#14 0x0000555557d24ef1 in core::ops::function::impls::{impl#2}::call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/core/src/ops/function.rs:284
#15 std::panicking::try::do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panicking.rs:554
#16 std::panicking::try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> ()
    at library/std/src/panicking.rs:518
#17 std::panic::catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panic.rs:142
#18 std::rt::lang_start_internal::{closure#2} () at library/std/src/rt.rs:148
#19 std::panicking::try::do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panicking.rs:554
#20 std::panicking::try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library/std/src/panicking.rs:518
#21 std::panic::catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panic.rs:142
#22 std::rt::lang_start_internal () at library/std/src/rt.rs:148
#23 0x0000555555b7208a in std::rt::lang_start<core::result::Result<(), anyhow::Error>> (main=0x55555592ea90 <matrix_bot::main>, argc=1, 
    argv=0x7fffffffdca8, sigpipe=0) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/rt.rs:165
#24 0x000055555592ecbe in main ()
#25 0x00007ffff7443cd0 in ?? () from /usr/lib/libc.so.6
#26 0x00007ffff7443d8a in __libc_start_main () from /usr/lib/libc.so.6
#27 0x00005555557f6a25 in _start ()

Any advice would be appreciated.

CyberShadow commented 2 months ago

In the end I deleted all local state files and let the bot log in as a new device.

I guess there is still probably a data integrity bug somewhere, and another issue where problems with the former are not well communicated / debuggable.

bnjbvr commented 2 months ago

Thanks for opening an issue. If you can still reproduce the bug, can you build with debug symbols? They're missing here, so we can't pinpoint which function in the SDK is causing the crash. (Although that looks like a deserialization issue, either an event couldn't be deserialized properly, or some data stored on disk has changed format after an upgrade and it wasn't properly annotated for clean deserialization.)

CyberShadow commented 2 months ago

If you can still reproduce the bug, can you build with debug symbols?

Yes, I saved a copy of the bad files.

But, it looks like it is already built with debug symbols (assuming unoptimized + debuginfo implies that):

$ cargo build             
    Finished dev [unoptimized + debuginfo] target(s) in 0.19s

$ RUST_BACKTRACE=1 cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/matrix_bot`
Error: invalid type: newtype struct, expected any valid JSON value at line 1 column 1061

Anything I'm missing?

bnjbvr commented 2 months ago

We're filtering out debug info for dubious reasons in the Cargo.toml dev profile, but there's a dbg profile you could use: cargo build --profile dbg should give us the debug symbols.

CyberShadow commented 2 months ago

but there's a dbg profile you could use: cargo build --profile dbg should give us the debug symbols.

I'm not sure if I did this right, but this seems to give me the same result.

Here is what I did:

  1. RUST_BACKTRACE=full cargo run --profile dbg didn't work (I got error: profile `dbg` is not defined).

  2. From reading https://doc.rust-lang.org/cargo/reference/profiles.html I understood that profiles are inherited by dependencies. So, I copied the definition of the profile to my Cargo.toml:

    [profile.dbg]
    inherits = "dev"
    debug = 2
  3. RUST_BACKTRACE=full cargo run --profile dbg now builds the program, but it crashes in the same way:

    Finished dbg [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/dbg/matrix_bot`
    Error: invalid type: newtype struct, expected any valid JSON value at line 1 column 1061
  4. I tried the gdb + breakpoint write approach again, but the stack trace does not seem any more useful:

Thread 1 "matrix_bot" hit Breakpoint 1, 0x00007ffff751a170 in write () from /usr/lib/libc.so.6
(gdb) where
#0  0x00007ffff751a170 in write () from /usr/lib/libc.so.6
#1  0x0000555557d29ebd in std::sys::pal::unix::fd::FileDesc::write () at library/std/src/sys/pal/unix/fd.rs:264
#2  std::sys::pal::unix::stdio::{impl#5}::write () at library/std/src/sys/pal/unix/stdio.rs:71
#3  std::io::Write::write_all<std::sys::pal::unix::stdio::Stderr> () at library/std/src/io/mod.rs:1706
#4  std::io::stdio::{impl#2}::write_all () at library/std/src/io/stdio.rs:176
#5  0x0000555557d2bd59 in std::io::stdio::{impl#21}::write_all () at library/std/src/io/stdio.rs:976
#6  std::io::Write::write_fmt::{impl#0}::write_str<std::io::stdio::StderrLock> () at library/std/src/io/mod.rs:1835
#7  0x0000555557d5873c in core::fmt::write () at library/core/src/fmt/mod.rs:1118
#8  0x0000555557d2a67b in std::io::Write::write_fmt<std::io::stdio::StderrLock> () at library/std/src/io/mod.rs:1846
#9  std::io::stdio::{impl#20}::write_fmt () at library/std/src/io/stdio.rs:956
#10 0x0000555557d2aab2 in std::io::stdio::{impl#19}::write_fmt () at library/std/src/io/stdio.rs:930
#11 std::io::stdio::attempt_print_to_stderr () at library/std/src/io/stdio.rs:1057
#12 0x00005555559d9491 in std::process::{impl#61}::report<(), anyhow::Error> (self=...) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/process.rs:2443
#13 0x0000555555b720b9 in std::rt::lang_start::{closure#0}<core::result::Result<(), anyhow::Error>> () at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/rt.rs:166
#14 0x0000555557d24ef1 in core::ops::function::impls::{impl#2}::call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/core/src/ops/function.rs:284
#15 std::panicking::try::do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panicking.rs:554
#16 std::panicking::try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/std/src/panicking.rs:518
#17 std::panic::catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panic.rs:142
#18 std::rt::lang_start_internal::{closure#2} () at library/std/src/rt.rs:148
#19 std::panicking::try::do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panicking.rs:554
#20 std::panicking::try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library/std/src/panicking.rs:518
#21 std::panic::catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panic.rs:142
#22 std::rt::lang_start_internal () at library/std/src/rt.rs:148
#23 0x0000555555b7208a in std::rt::lang_start<core::result::Result<(), anyhow::Error>> (main=0x55555592ea90 <matrix_bot::main>, argc=1, argv=0x7fffffffdc98, sigpipe=0) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/rt.rs:165
#24 0x000055555592ecbe in main ()
#25 0x00007ffff7443cd0 in ?? () from /usr/lib/libc.so.6
#26 0x00007ffff7443d8a in __libc_start_main () from /usr/lib/libc.so.6
#27 0x00005555557f6a25 in _start ()
(gdb) quit
bnjbvr commented 2 months ago

I'm not sure if our dev setting is being propagated to upstream users of the library… Can you try this, one line for each matrix- crate you're using, please:

[profile.dev.package]
matrix-sdk = { debug = 2 }
matrix-sdk-base = { debug = 2 }
CyberShadow commented 2 months ago

OK, I added the following to Cargo.toml:


[profile.dev.package]
matrix-pickle = { debug = 2 }
matrix-pickle-derive = { debug = 2 }
matrix-sdk = { debug = 2 }
matrix-sdk-base = { debug = 2 }
matrix-sdk-common = { debug = 2 }
matrix-sdk-crypto = { debug = 2 }
matrix-sdk-indexeddb = { debug = 2 }
matrix-sdk-sqlite = { debug = 2 }
matrix-sdk-store-encryption = { debug = 2 }

I got this list from Cargo.lock:

cat Cargo.lock | grep '^name = "matrix-' | sed 's/^name = "\(.*\)"/\1 = { debug = 2 }/'

However, it still seems to behave in the same way:

$ RUST_BACKTRACE=full cargo run                                                          
   Compiling matrix-sdk-common v0.6.0 (https://github.com/matrix-org/matrix-rust-sdk?rev=a094e10b35615e12488b36e01ca4944353aa8c4e#a094e10b)
   Compiling matrix-sdk v0.6.2 (https://github.com/matrix-org/matrix-rust-sdk?rev=a094e10b35615e12488b36e01ca4944353aa8c4e#a094e10b)
   Compiling matrix-pickle-derive v0.1.0
   Compiling matrix-pickle v0.1.0
   Compiling vodozemac v0.4.0
   Compiling matrix-sdk-crypto v0.6.0 (https://github.com/matrix-org/matrix-rust-sdk?rev=a094e10b35615e12488b36e01ca4944353aa8c4e#a094e10b)
   Compiling matrix-sdk-base v0.6.1 (https://github.com/matrix-org/matrix-rust-sdk?rev=a094e10b35615e12488b36e01ca4944353aa8c4e#a094e10b)
   Compiling matrix-sdk-sqlite v0.1.0 (https://github.com/matrix-org/matrix-rust-sdk?rev=a094e10b35615e12488b36e01ca4944353aa8c4e#a094e10b)
   Compiling matrix_bot v0.1.0 (/home/vladimir/work/matrix-bot_)
    Finished dev [unoptimized + debuginfo] target(s) in 36.76s
     Running `target/debug/matrix_bot`
Error: invalid type: newtype struct, expected any valid JSON value at line 1 column 1061
bnjbvr commented 2 months ago

Uh, sorry, I completely misread this: this is likely an Error result transformed into a log somewhere; would be nice to look with a debugger where this error result comes from, to have a better idea of what's going on (although I strongly suspect it's a change in the storage format).

CyberShadow commented 2 months ago

would be nice to look with a debugger where this error result comes from, to have a better idea of what's going on

I've tried to do this, but I don't know which symbol to breakpoint. breakpoint write didn't get me anywhere - any suggestions?

zecakeh commented 2 months ago

We also had 2 bug reports about this a few months ago for Fractal: https://gitlab.gnome.org/World/fractal/-/issues/1416. The time when it happened didn't match an update of the SDK on our end.

We didn't really have the time to try to debug it. Given the error message, we just know that it happens either in ClientBuilder::build() or in Client::restore_session().

Deleting the stores seems to get rid of the issue.