Very frequent input/output errors on `read` in tracer

thomas-daniels commented 3 years ago

Happens with both tracer and tracer_loop. It attaches to the Celeste executable but very shortly after, I get read: Input/output error and it crashes. (tracer_loop starts a new instance shortly after but the read error just repeats then)

Not all reads are unsuccessful. Running the timer shortly shows sensible timings until the read error happens.

Ubuntu 20.10, Celeste 1.4.0.0 (I had Everest enabled at the time if that matters). Happy to provide more/better diagnosis of the problem, just not sure where and what to look for at the moment.

rhelmot commented 3 years ago

I'm not sure what this could be... I'm running on Ubuntu 20.04 Celeste 1.4.0.0 Everest 2889 and things seem to work fine. Could you try building the tracer with -DDEBUG and seeing if the debug output is at all helpful? Are you running with celeste's built-in monokickstart or your system mono?

For context, the most common cause of read: Input/output error is basically a "segfault" on trying to access the traced program's memory. It means we're trying to load unmapped memory. If that's happening after the program has successfully booted and traced a few frames, it probably indicates that something concerning is happening that alters the program's memory structure without our knowing. I guess this could happen if for some reason there were a lot of latency introduced somewhere in the tracer, such that in between reading a pointer and using it, Celeste could free that pointer so we're accessing invalid data (not unmapped but rather containing undefined values).

Your tasks:

Compile with DEBUG set and show me what a crash looks like
Run the tracer (compiled with debug symbols) under gdb with a breakpoint set on exit so we can get a stack trace

thomas-daniels commented 3 years ago

Here's the debug log (from _loop):

... more of the same stuff here ...
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
read: Input/output error
Connected to Celeste.exe+Everest
ASI @ 0x7f6f30516620
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
Passed
areas_arr = 0x7f6edd177790
area_stats = 0x7f6ea39536f0
mode_arr = 0x7f6f22a58188
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
mode_stats = 0x7f6ea3902d10
checkpoints_obj = 0x7f6ea38ff630
CurrentLevelCheckpoints = 0
chapter = 0, mode = 0
savedata_addr = 0x7f6f486a9840
read: Input/output error

And here's the gdb output (not _loop):

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 4012]
[New Thread 0x7ffff7d90640 (LWP 4013)]
read: Input/output error
[Switching to Thread 0x7ffff7d90640 (LWP 4013)]

Thread 2 "celeste_tracer" hit Breakpoint 1, __GI_exit (status=1) at exit.c:138
138 exit.c: No such file or directory.

(don't know if this is useful or if I'm using gdb entirely right - been years since I last used it and never used it extensively)

Let me know if anything else could be useful!

rhelmot commented 3 years ago

In gdb, you need to input the command bt in order to get a backtrace.

thomas-daniels commented 3 years ago

(gdb) bt
#0  __GI_exit (status=1) at exit.c:138
#1  0x0000555555555876 in read_mem (fd=3, addr=10581584, buf=0x7ffff7d8fc70, len=8) at celeste_tracer.c:88
#2  0x00005555555558b2 in read_qword (fd=3, addr=10581584) at celeste_tracer.c:94
#3  0x00005555555562f7 in load_base_info (memfd=3) at celeste_tracer.c:268
#4  0x00005555555565f0 in dump_info_loop (v=0x55555555a2e0) at celeste_tracer.c:346
#5  0x00007ffff7f87590 in start_thread (arg=0x7ffff7d90640) at pthread_create.c:463
#6  0x00007ffff7ead223 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

thomas-daniels commented 3 years ago

Actually, I realized I wasn't thinking this through. My argument to --launch was not the actual Celeste executable, but a shell script redirecting the Celeste output to /dev/null (because I wasn't interested in that output) so I guess that's where things to wrong when reading memory. It doesn't crash when I pass on the actual Celeste executable.

I suppose this invalidates the read errors for non-_loop, but they still happen for _loop. I'll see if I can debug that.

rhelmot commented 3 years ago

celeste_tracer_loop is a very simple shell script - you should be able to read it and replicate its contents in order to work on debugging.

By any chance, did you implement your redirect script by replacing Celeste.bin.x86_64? That will cause problems, since the way the tracer's attach mode works is that it searches through the list of active processes for Celeste.bin.x86_64 and assumes that is the mono process. You should consider instead modifying the Celeste script.

thomas-daniels commented 3 years ago

No, the redirect script was an entirely new script.

It's very strange that I also cannot reproduce the issue anymore with _loop or --attach... I don't think I'm doing anything else than before. I think I'll close it then until it happens again or I get better steps to reproduce.

thomas-daniels commented 3 years ago

I found a consistent way to reproduce it! The tracer works fine until I create a new save file. After that, it keeps giving the read error (until I restart Celeste, just switching save files or anything else doesn't help).

savedata_addr = 0x7fb08e4553f8
mode_stats = (nil)
chapter = -1, mode = -1
savedata_addr = 0x7fb08e4553f8
read: Input/output error
[Switching to Thread 0x7ffff7d90640 (LWP 38869)]

Thread 2 "celeste_tracer" hit Breakpoint 1, __GI_exit (status=1) at exit.c:138
138 exit.c: No such file or directory.
(gdb) bt
#0  __GI_exit (status=1) at exit.c:138
#1  0x0000555555555876 in read_mem ()
#2  0x00005555555559aa in read_byte ()
#3  0x0000555555555ca1 in class_kind ()
#4  0x0000555555555cf3 in class_field_offset ()
#5  0x0000555555556092 in instance_field_dword ()
#6  0x00005555555567f8 in dump_info_loop ()
#7  0x00007ffff7f87590 in start_thread (arg=0x7ffff7d90640) at pthread_create.c:463
#8  0x00007ffff7ead223 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

rhelmot commented 2 years ago

Hey! Sorry to say, but I still can't reproduce it :( I attached the tracer and did three laps of deleting a save and playing prologue, and nothing crashed at all.

The only thing I can think of at this point based on your log output is that somehow your system has an incredibly high latency wrt /proc/pid/mem reads or console I/O or something, and somehow between the DBGPRINT("savedata_addr and the DeathCount = instance_field_dword lines the SaveData instance is being reallocated or garbage collected or something. Here are some potential solutions:

1) Add a "safe mode" to the tracer that does nothing but dump the autosplitter info object. This would be kind of sad. 2) Replace the exit() with a longjmp back to the start of the main loop. This would be kind of dangerous, as it has the potential to soak up real bugs that aren't just race conditions 3) Pause the entire celeste process while we're doing the extraction. I don't actually know how to do this and I don't know if it would be quick enough to prevent the celeste process from lagging.

Do you think you have the ability to implement any of these for me? If not, I'll gladly do whatever we decide on, it might just take me a few weeks to get around to it.

rhelmot commented 2 years ago

AH NEVER MIND, I TRIED AGAIN USING EVEREST AND IT CRASHED IMMEDIATELY. I'll see what I can do!!!

thomas-daniels commented 2 years ago

Thanks for trying it out! Yeah, I think I tried it on both my Ubuntu laptop and my Arch desktop, and it was really consistent, but I did indeed only test with Everest. Didn't occur to me to test without. Glad you could reproduce and thanks for your help!

rhelmot commented 2 years ago

I think I "fixed" it, with option 2. Try it out~!

thomas-daniels commented 2 years ago

Yep, that works! It seems like starting a new save now leads to one read/write error and the next iteration in the loop runs fine. Thanks!

rhelmot / CelesteMagicTimer

Very frequent input/output errors on `read` in tracer #16