google / oss-fuzz

OSS-Fuzz - continuous fuzzing for open source software.
https://google.github.io/oss-fuzz
Apache License 2.0
10.45k stars 2.21k forks source link

Rust-based fuzzers frequently crashing at startup without logs #10186

Closed alexcrichton closed 1 year ago

alexcrichton commented 1 year ago

Recently we've found a few bugs locally in Wasmtime and its related projects through fuzzing which felt like they should have been found on OSS-Fuzz much longer ago. I've been looking at some of the coverage reports and it looks like Wasmtime's fuzzers fuzzing_time_percent, which used to be 99+, is now <50 for almost all fuzzers. Taking a look at a simple fuzzer, libFuzzer_wasmtime_wasm-tools-text-parser, its perf_report links say that 43% of runs are affected by a startup_crash, but when looking at the logs it appears to be blank. For example 07:49:29:852659.log shows:

Component revisions (build r202304250620):
Wasmtime: a1732b2906d79033ff9266ac391a9783ac6daa27
Regalloc2: 244ede83638762701831453a43e6011a7dbe0e94
Wasm-tools: 016838279808be4d257f1b58b9942420f0a09855

Bot name: oss-fuzz-linux-zone1-host-xdmt-12
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_wasmtime_9d7f296cb3c934976ab46f0ee760a3a07ef3344a/revisions/wasm-tools-text-parser -timeout=25 -rss_limit_mb=2560 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=5700 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-403/new /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-403/mutations /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-403/subset
Time ran: 0.011565923690795898

cf::fuzzing_strategies: corpus_subset:75

Is there a way for us to dig into what the crash is and figure out what's going on here?

One wild guess is that Rust in the past few months upgraded from LLVM 15 to LLVM 16 which may be causing issues, but that's just a guess.

guidovranken commented 1 year ago

I don't know if it's related to your issue, but the wasmtime fuzzers are very slow. In order to get a corpus with much better coverage (https://github.com/bytecodealliance/wasmtime-libfuzzer-corpus/pull/5) I had to fuzz the instantiate fuzzer on 256 cores for weeks. There are also some very slow inputs in the current public OSS-Fuzz corpus for the instantiate fuzzer that are particularly slow (e.g. take 20 seconds or so to complete. I didn't report this to you as I assumed you were already notified by OSS-Fuzz of this). Additionally I think the fuzzer running on a large corpus requires more memory than the default 2GB; if you don't set libFuzzer's -rss_limit_mb to something higher than that, you will get out-of-memory crashes. At least the instantiate fuzzer would benefit greatly from less resource consumption.

alexcrichton commented 1 year ago

I submitted https://github.com/bytecodealliance/wasmtime/pull/6330 which should fix the long test case issue you're seeing. That's a good point about -rss_limit_mb though, thanks for the reminder! I'll try to rerun the oss-fuzz-built fuzzers over the oss-fuzz-generated-corpuses with that option and see if anything runs away with too much memory.

alexcrichton commented 1 year ago

I'm unfortunately still perplexed as to what's going on here. One example is:

The log there says:

Component revisions (build r202305030624):
Wasmtime: 60ce6f5d5c5b7defc6f7a8bfaa6bfb8db82b0aa1
Regalloc2: 1b4287b6ad0e756a833a7ed5f62c3b475f0b1af0
Wasm-tools: 34216db2b321d38a67b42300034cdbdeaf780fa0

Bot name: oss-fuzz-linux-zone6-host-xd76-9
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_wasmtime_9d7f296cb3c934976ab46f0ee760a3a07ef3344a/revisions/instantiate -timeout=25 -rss_limit_mb=2560 -fork=2 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=6200 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-383/new /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/wasmtime_instantiate
Time ran: 0.03678083419799805

cf::fuzzing_strategies: fork:2

and the testcase listed is an empty file. This seems like the fuzzer might be crashing while it's loading the corpus perhaps. Again though above I come back to the text-parser fuzzer which is quite simple and runs quite fast but still has bad performance.

I also just remembered to check a different project I have on OSS-Fuzz, flate2-rs. That one has had no code changes in months and the April 3, 2023 report had it at 99.5 fuzzing_time_percent where April 4, 2023 it dropped to 48.6. Similarly it has a number of blank startup crashes.

Is there a way I could inspect the build logs from these builds from a month or so ago? I'd like to compare versions of software/compilers/etc to try to see what changed and might be causing this.

alexcrichton commented 1 year ago

@DavidKorczynski IIRC you've integrated a number of Rust projects into OSS-Fuzz historically, so I was wondering if you'd be able to check ot see if they're experiencing similar nonzero-startup_crash-issues as well? for Wasmtime and flate2 I was seeing 30-40 startup_crash percentages. Given that these are entirely unrelated projects and that flate2 has had no changes at all recently I'm curious if other Rust projects are similarly plauged as well.

I've tried to narrow this down to something that changed, for example cargo-fuzz-related or rustc related, but I've so far been unsuccessful. I'm hoping though that if more unrelated projects are experiencing something similar if it's possible to draw connections and see what's going on?

DavidKorczynski commented 1 year ago

@alexcrichton I'll take a look at this tomorrow!

DavidKorczynski commented 1 year ago

@alexcrichton I went over three projects (serde_json, image_png, json5format)and saw this behaviour on all three:

Bot name: oss-fuzz-linux-zone8-host-bhqx-14 Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/buf_independent -timeout=25 -rss_limit_mb=2560 -max_len=2358 -fork=2 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=6200 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-383/new /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/image-png_buf_independent Time ran: 0.014983415603637695

cf::fuzzing_strategies: fork:2,random_max_len:1


Similar to @alexcrichton startup_crash_percent started happening on the April 4 for all of the projects. Screenshot of the `startup_crash_percent` column when looking at the history of individual fuzzers:

![Screenshot 2023-05-08 112035](https://user-images.githubusercontent.com/657617/236799844-d11f6054-571f-4ed1-83fd-24908ae4a44d.png)

I see `startup_crash_percent` is 0 on April 3, but ~20 April 4 and then goes up to ~35ish This is consistent amongst the three projects.
DavidKorczynski commented 1 year ago

Most logs does not have much info, however, I got this one which reports which reports a Segmentation fault :

Component revisions (build r202305070613):
Image-png: 4579a7cdb568c6bf60c252a5c488a242cf713954

Bot name: oss-fuzz-linux-zone8-host-bhqx-14
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode -timeout=25 -rss_limit_mb=2560 -max_len=407 -fork=2 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=6200 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-383/new /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/image-png_decode
Time ran: 12.281909227371216

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2880567337
INFO: Loaded 1 modules   (38781 inline 8-bit counters): 38781 [0x563d871e3440, 0x563d871ecbbd),
INFO: Loaded 1 PC tables (38781 PCs): 38781 [0x563d871ecbc0,0x563d87284390),
INFO: -fork=2: fuzzing in separate process(s)
INFO: -fork=2: 2765 seed inputs, starting to fuzz in /tmp/libFuzzerTemp.FuzzWithFork19243.dir
#0: cov: 16371 ft: 16371 corp: 2765 exec/s 0 oom/timeout/crash: 0/0/0 time: 12s job: 1 dft_time: 0
INFO: log from the inner process:
Segmentation fault
INFO: exiting: 139 time: 12s
cf::fuzzing_strategies: fork:2,random_max_len:1

I see other similar logs:

Component revisions (build r202305070613):
Image-png: 4579a7cdb568c6bf60c252a5c488a242cf713954

Bot name: oss-fuzz-linux-zone6-host-c3cl-14
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode -timeout=25 -rss_limit_mb=2560 -fork=2 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=6200 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-383/new /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/image-png_decode
Time ran: 43.253541231155396

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1848597903
INFO: Loaded 1 modules   (38781 inline 8-bit counters): 38781 [0x5fc9ea4ed440, 0x5fc9ea4f6bbd),
INFO: Loaded 1 PC tables (38781 PCs): 38781 [0x5fc9ea4f6bc0,0x5fc9ea58e390),
INFO: -fork=2: fuzzing in separate process(s)
INFO: -fork=2: 3350 seed inputs, starting to fuzz in /tmp/libFuzzerTemp.FuzzWithFork12249.dir
#610: cov: 20096 ft: 20096 corp: 3350 exec/s 305 oom/timeout/crash: 0/0/0 time: 42s job: 1 dft_time: 0
  NEW_FUNC: 0x5fc9ea0cb5e0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x2105e0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea0cbfe0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x210fe0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea0d3f00  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x218f00) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea0dbd10  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x220d10) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea144ab0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x289ab0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea144af0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x289af0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea145e00  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x28ae00) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea146420  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x28b420) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea15fc50  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x2a4c50) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea230a10  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x375a10) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea2600c0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x3a50c0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea26ce80  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x3b1e80) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea270380  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x3b5380) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea2713b0  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x3b63b0) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
  NEW_FUNC: 0x5fc9ea271d10  (/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode+0x3b6d10) (BuildId: f8e2ac9f3c1044d9909c1257f286bf95b49cfc97)
#1293: cov: 20201 ft: 20097 corp: 3351 exec/s 227 oom/timeout/crash: 0/0/0 time: 43s job: 2 dft_time: 0
#1293: cov: 20201 ft: 20097 corp: 3351 exec/s 0 oom/timeout/crash: 0/0/0 time: 43s job: 4 dft_time: 0
INFO: log from the inner process:
Segmentation fault
INFO: exiting: 139 time: 43s
cf::fuzzing_strategies: fork:2
DavidKorczynski commented 1 year ago

Another strange error I came across with some more logging, that may be related since it happens early is image_png 2023-5-08 02:22:29:7930734)

Component revisions (build r202305070613):
Image-png: 4579a7cdb568c6bf60c252a5c488a242cf713954

Bot name: oss-fuzz-linux-zone8-host-bhqx-15
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode -timeout=25 -rss_limit_mb=2560 -dict=/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_image-png_b0e58386fac39f658accbcc17c96e929fd304259/revisions/decode.dict.merged -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=5700 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-385/new /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-385/mutations /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/image-png_decode
Time ran: 0.019207000732421875

ParseDictionaryFile: error in line 3
        "\001\322"
cf::fuzzing_strategies: corpus_mutations_radamsa:1,recommended_dict:1
DavidKorczynski commented 1 year ago

Is there a way I could inspect the build logs from these builds from a month or so ago? I'd like to compare versions of software/compilers/etc to try to see what changed and might be causing this.

@jonathanmetzman @Alan32Liu do you know how to do this?

alexcrichton commented 1 year ago

Thanks for confirming @DavidKorczynski! Sounds like Wasmtime itself is unlikely the culprit here and possibly something Rust-specific so far. Do you happen to have any non-Rust projects under OSS-Fuzz? If so, do they also show this startup crash phenomenon on April 4? If so then it may not be Rust-specific, and if not that'd be a good confirmation it's Rust-specific.

DavidKorczynski commented 1 year ago

If so, do they also show this startup crash phenomenon on April 4?

I checked and non-Rust projects run fine -- as far as I can tell this is a Rust issue.

I think the easiest would be to get some of the build logs from those dates as you suggest above, and then investigate the differences.

alexcrichton commented 1 year ago

Out of curiosity have you been able to reproduce any crashes locally? I've so far been unsuccessful unfortunately.

DavidKorczynski commented 1 year ago

Out of curiosity have you been able to reproduce any crashes locally? I've so far been unsuccessful unfortunately.

I'm unable able to reproduce as well

guidovranken commented 1 year ago

Just an idea: OSS-Fuzz runs the fuzzers on a variety of CPUs with different capabilities (e.g. instructions like AVX-512 not supported by all CPUs). If the Rust compiler detects AVX-512 on the build system at compile time it might insert these instructions into the binaries, leading to crashes on a percentage of CPUs. Similarly the MOVBE instruction will crash on certain CPUs. Maybe something changed in the Rust compiler with regards to this before and after April 4th.

alexcrichton commented 1 year ago

Perhaps! Given that there's only one build a day I'm assuming that the same binary runs everywhere, though, meaning that any compile-time-feature detection would not work since it would auto-crash on machines without features. Runtime detection is of course still possible, though.

It'd be helpful to see the exact compiler verison used to produce the April 4 and April 3 binaries. I don't have that off-hand and the changes between nightly-2023-04-{03,04} do not look suspicious to me. If https://github.com/google/oss-fuzz/pull/10270 I can try a bisection perhaps with a once-a-day roundtrip time :)

jonathanmetzman commented 1 year ago

I don't have any easy way to get you old build logs

alexcrichton commented 1 year ago

Ok https://github.com/google/oss-fuzz/pull/10270 didn't fix things for Wasmtime as it still has a nonzero startup crash percentage recorded for May 10 fuzzing yesterday. An example log is:

Component revisions (build r202305090625):
Wasmtime: 2300ef6894605e89bd94f6b2da75ba55e5ab91ac
Regalloc2: 1b4287b6ad0e756a833a7ed5f62c3b475f0b1af0
Wasm-tools: 16b82b2db210df9b6a7c0c354c0dc8528f700b43

Bot name: oss-fuzz-linux-zone6-host-xd76-11
Return code: 1

Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_wasmtime_9d7f296cb3c934976ab46f0ee760a3a07ef3344a/revisions/wasm-tools-text-parser -timeout=25 -rss_limit_mb=2560 -fork=2 -artifact_prefix=/mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/ -max_total_time=6500 -print_final_stats=1 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases-disk/temp-386/new /mnt/scratch0/clusterfuzz/bot/inputs/data-bundles/wasmtime_wasm-tools-text-parser
Time ran: 44.596296548843384

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 223596020
INFO: Loaded 1 modules   (52668 inline 8-bit counters): 52668 [0x5d2becb92a00, 0x5d2becb9f7bc),
INFO: Loaded 1 PC tables (52668 PCs): 52668 [0x5d2becb9f7c0,0x5d2becc6d380),
INFO: -fork=2: fuzzing in separate process(s)
INFO: -fork=2: 11516 seed inputs, starting to fuzz in /tmp/libFuzzerTemp.FuzzWithFork16003.dir
#0: cov: 83277 ft: 83277 corp: 11516 exec/s 0 oom/timeout/crash: 0/0/0 time: 44s job: 1 dft_time: 0
INFO: log from the inner process:
Segmentation fault
INFO: exiting: 139 time: 44s
cf::fuzzing_strategies: fork:2

where the reproduction case is empty.

Given that it seems like this probably wasn't caused by the Rust nightly being updated, and in the context of flate2-rs there were no other project changes in this time period as well.

@jonathanmetzman do you know if it's possible to see perhaps deployment records or logs or such for what changed between April 3 and April 4? Alternatively do you know of something else that can be done here to capture something like a core dump or get more logging as to what's happening? My guess now is that something about how Rust fuzzers are built needs to be updated to account for a recent change, but all the changes for projects between April 3 and April 4 I've looked at (here, clusterfuzz, llvm, etc) look unrelated.

alexcrichton commented 1 year ago

I'm not sure what happened but as of the May 11 fuzzing data both Wasmtime and flate2-rs now have a 0% startup crash percentage. I suppose I'll close this if it persists for a day or so.

alexcrichton commented 1 year ago

May 11 and onward fuzzing has a 0% startup crash rate for both flate2-rs and wasmtime. I'm not sure what changed, but I'll close this as it seems resolved.