Open bagder opened 2 years ago
I think this might be an issue with AFL and its instrumentation rather than Curl code, CC @vanhauser-thc
there is nothing I can do there. I need the testcase that produces that crash so I can have a look. (and I have no permission to the issue on chromium, the curl issue has no content that helps me in any way)
The curl fuzzer runs as part of oss-fuzz since years and is fully public here: https://github.com/curl/curl-fuzzer
could you share the testcase (i.e. the crash-* file to trigger the heap overflow) @bagder ?
here it is: clusterfuzz-testcase-minimized-curl_fuzzer_gopher-4644400603463680
14 bytes
$ hd clusterfuzz-testcase-minimized-curl_fuzzer_gopher-4644400603463680
00000000 00 01 00 00 00 07 46 49 3a 2f 00 01 00 41 |......FI:/...A|
0000000e
I cannot reproduce it:
# ./curl_fuzzer_gopher clusterfuzz-testcase-minimized-curl_fuzzer_gopher-4644400603463680
Warning: AFL++ tools might need to set AFL_MAP_SIZE to 74989 to be able to run this instrumented program if this crashes!
Reading 14 bytes from clusterfuzz-testcase-minimized-curl_fuzzer_gopher-4644400603463680
Execution successful.
but there was a bugfix 2-3 days ago for afl++ in oss-fuzz. could that have fixed it? otherwise I need the buildlog to be able to build the exact same gopher binary (several instrumentation options are random)
We're several people who fail to repro it. oss-fuzz insists though... :grin:
oss-fuzz said "This crash occurs very frequently on linux platform" 14 hours ago most recently
:) if someone can paste the excerpt of a build log that fails (lines that contain "AFL_") then I can have a look
btw, it indicates this as the regression range for AFLplusplus: https://github.com/AFLplusplus/AFLplusplus/commit/7777045c09c404b1274c930788317525fedb43ad
Another strong hint this is an AFL++ can be seen on the part of thestcase page:
TESTCASE ANALYSIS ON OTHER JOBS
No reproducible variants found.
"Real" bugs in AFL targets should also be reproducible in the libFuzzer ASAN targets
I actually was able to reproduce this with the build linked to from OSS-Fuzz : https://storage.cloud.google.com/clusterfuzz-builds-afl/curl/curl-address-202111210610.zip I think I know why no one could reproduce this if this is an instrumentation issue. Since AFL++ builds randomly decide to use cmplog instrumentation, instrumentation crashes may need to be reproduced on the build on which they were found in OSS-Fuzz. This of course is not an issue in 99% cases because 99% of cases are not crashes due to this random instrumentation but are due to actual bugs in the code.
I think I know why no one could reproduce this if this is an instrumentation issue. Since AFL++ builds randomly decide to use cmplog instrumentation, instrumentation crashes may need to be reproduced on the build on which they were found in OSS-Fuzz. This of course is not an issue in 99% cases because 99% of cases are not crashes due to this random instrumentation but are due to actual bugs in the code.
There are currently similar issues happening in other projects https://github.com/google/oss-fuzz/issues/6844#issuecomment-974005289
In the comment I asked your opinion on this, i.e. on the fact that it's not possible to reliably reproduce an issue in this type of event - is it worth addressing this?
I actually was able to reproduce this with the build linked to from OSS-Fuzz : https://storage.cloud.google.com/clusterfuzz-builds-afl/curl/curl-address-202111210610.zip
Edit: can reproduce it, just used the wrong binary ...
but I am unable to test if my fix works as I do not know what the binary explicitly needs to be built like that. I will have a look with gdb ...
the rewrite of cmplog was complex and corner cases are an issue sigh ... this did not come up in fuzzbench with is my second line of CI/unittests ;)
In the comment I asked your opinion on this, i.e. on the fact that it's not possible to reliably reproduce an issue in this type of event - is it worth addressing this?
for me it is worth it. if there is a bug it should be fixed.
I actually was able to reproduce this with the build linked to from OSS-Fuzz : https://storage.cloud.google.com/clusterfuzz-builds-afl/curl/curl-address-202111210610.zip
I tried this - and it does not crash with the testcase that was supplied.
Weird. I thought I cracked the case.
In the comment I asked your opinion on this, i.e. on the fact that it's not possible to reliably reproduce an issue in this type of event - is it worth addressing this?
for me it is worth it. if there is a bug it should be fixed.
+1
Just to clarify, I meant addressing the issue of reliably reproducing, i.e. the solution would be creating some added logic in OSS-Fuzz that makes it possibly to build the AFL in the exact same setting as the one where the bug happens.
the build log prints the AFL_... settings that were chosen. with that information it is enough - but needs hands on work to edit compile_afl.
because of bash -eu
I cannot test for envs that do not exist, so not sure how I could force a config without editing the script :(
(I could do "env | grep ..." but not sure that is an acceptable solution. I did that originally, but Jonathan didnt like it.)
oss-fuzz just closed the original curl issue 41258 now as "verified". I believe we can then also consider this issue closed!
Thanks!
We just got a new oss-fuzz issue filed that looks very similar and also appears to be A) a false positive and B) not possible to reproduce for us. Want us to file a new issue?
Sorry about that (and for the delay in responding, there was a public holiday in the US). I'll reopen this. IMO there's something wrong with one of AFL++'s modes, lot's of random breakages.
Random, yes. We got two new issues filed (the new one mentioned here and then yet another one) but not too long after, they were both again closed as verified without us doing anything...
There are similar stories in Binutils, where we have recently seen a lot (20+) of dynamic-stack-buffer-overflows which are opened and then closed relatively short after, examples include:
I have a fix since last week but I still need the fuzzbench run that was accepted yesterday to complete until I make a PR to ossfuzz.
my commit is in for a few days now - are the issues gone?
We had three issues in curl and they're all gone.
We had three issues in curl and they're all gone.
Closing this issue now as the Curl issues are gone.
Just as a follow-up: we've had numerous follow-up false positive reports since then. To the extent that I no longer even look at oss-fuzz reports when they come in...
Thanks @bagder - and sorry for the inconvenience with the false positives. Could you provide links to the false positives?
I have one other false positive from apache-httpd:
Basically all oss-fuzz reported on curl for the last few months (except that we actually got two "real" one this week):
Here are some of them:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43200 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=41403 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=41390 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40254 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40824 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=36072
I think I missed some of the more recent ones but I can't really figure out how to list them all.
All these have gone verfied/fixed all by themselves.
Thanks for the links and info! We'll take a look at these and also monitor if they persist.
I think https://github.com/google/oss-fuzz/pull/7026 might fix these issues and that PR was merged two days ago. As such, it may be that the false positives will stop occurring from now on.
fingers crossed
I am not sure this is an afl++ thing because when I check the linked reports several are from libfuzzer: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40254 - Crash Revision: https://oss-fuzz.com/revisions?job=libfuzzer_asan_i386_curl&revision=202110220613 same for https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=36072 and https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43200
you're right! Thanks @vanhauser-thc !
Looking at the crash state it's also unreproducible and related to mmap. @bagder Could this be some memory allocation exhaustion that is non-deterministic, but is in fact a true positive?
I'm sorry but I don't keep track of what particular thing that trigger the oss-fuzz reports I get. I just know that over the last few months I've basically not gotten any correct reports at all but a lot of those that fixes themselves after a few days of me not doing anything at all.
@bagder - i am really sorry to hear about these false positive issues. We will work with AFL++ developer and also try to disable any particular instrumentation modes that are causing this. Fuzzing should have zero false positives, so sorry for this experience.
@inferno-chromium this is not an afl++ issue. libfuzzer is also reporting crashes.
@inferno-chromium this is not an afl++ issue. libfuzzer is also reporting crashes.
In that case, we should probably look at the regression range of those filed bugs and see if there is any curl commit that caused this badness. @DavidKorczynski @jonathanmetzman ?
I think what has happened is that the AFL++ false positives and unreproducible libfuzzer issues were both conceived as false positives, and from the perspective of @bagder they were both treated as "issues that can't be reproduced". Please correct me if I'm wrong @bagder, but it would be great to hear your thoughts on the user experience. I think libFuzzer issues are true issues - albeit unreproducible issues, which is something that has been debated in the past to keep/not keep if I remember correctly.
With regards to false positives: I believe with https://github.com/google/oss-fuzz/pull/7127 the hope is all should be sorted.
With regards to unreproducible issues: maybe we should add an option that allows users to opt out of receiving those?
Basically all oss-fuzz reported on curl for the last few months (except that we actually got two "real" one this week):
Here are some of them:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43200 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=41403 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=41390 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40254 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40824 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=36072
I think I missed some of the more recent ones but I can't really figure out how to list them all.
Bucketizing these issues, as they are two separate categories (unreproducible/ versus false positive):
I believe all AFL++ issues have been fixed with @vanhauser-thc recent updates, and I would not expect libfuzzer unreproducible issues to disappear.
Speaking on the unreproducible issues, then there looks to be two issues that have been reported multiple times: see this link :
curl_fuzzer_rtsp: Null-dereference WRITE with empty stacktrace
, e.g. https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=40254curl:curl_fuzzer_rtsp: Sanitizer CHECK failure in "((0 && "unable to mmap")) != (0)" (0x0, 0x0)
, e.g. https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=36072Both of these have been reported for more than a year on-and-off, and one as early as 2019: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17767&q=Proj%3Dcurl%20Type%3DBug&can=1
The unreproducible curl:curl_fuzzer_rtsp: Sanitizer CHECK failure in "((0 && "unable to mmap")) != (0)" (0x0, 0x0)
issue has been reported 15 times since 2019. Perhaps it makes sense to look into improving issue triaging? https://bugs.chromium.org/p/oss-fuzz/issues/list?q=Proj%3Dcurl%20Type%3DBug%20label%3DUnreproducible&can=1
In order to improve the user experience and remove noise of false positives/unreproducible issues described in the comments above, I think there are a number of solutions:
Speaking of the httpd reports in https://github.com/google/oss-fuzz/issues/6879#issuecomment-1013536843, the latest is from yesterday (https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=43654) and was auto-closed today (still without anyting done on the httpd side). I don't know if #7127 was tested against this particular report, in any case the httpd team does not see how it can be valid since the reproducer provided for each of these reports is almost always the same and makes no sense with regard to the reported Heap-buffer-overflow (unless the original code is instrumented somehow..). Also, if there should be an opt-out option for receiving such issues, I'd suggest that they are not made public either or it might confuse users (which I think is the case already today for the issues being closed/reopened each week).
Interestingly (possibly?) like for https://github.com/curl/curl/issues/8041 the stacktrace for the httpd case also bypasses a strcasecmp()
like function which (supposedly) the tool thinks that it's bypassed by an inner \0
, while it's not (absent stray photons and/or multiple pairs of eyes in the httpd team that suddenly went blind).
Thanks for the details @ylavic !
In the httpd case the issue does not seem to be unreproducibility but rather the AFL++ issues. Looking at:
They are all related to the same issue and based on AFL++ - they are false positives, i.e. they are reproducible although in this case one has to ensure AFL++ is compiled in the right way for reproducing and if one is not aware of this one might conclude that they are unreproducible. I think several things should be improve based on the httpd case, e.g.
- deduplication should be improved so the same issue is not reported many times - and this should be done independently of whether the on-off-on-off is due to false positives or unreproducibility.
- the afl++ false positives should be solved. It may be that afl++ fixes #7026 did this but am unsure - since the bug came up again the 14th I would assume not. Am not sure if this PR Ignore fuzz setup problem detection during building with afl++ #7127 has updates for false positive fixing.
Re deduplication we usually do account for previously filed bugs: https://github.com/google/clusterfuzz/blob/6420424de00585e01038cb83e7b4d55b31ad64b3/src/appengine/handlers/cron/triage.py#L229 but the window that we look back is 2 days. Perhaps we can be a little smarter here and increase the window exponentially up to a limit based on how many times we've already filed the bug, but this can have other issues.
It appears that disabling instrumentation randomness (https://github.com/google/oss-fuzz/issues/7039) would have prevented this bug opening/closing cycle completely. Randomly choosing instrumentation during build time does not fit well with ClusterFuzz's model for logic for bug filing/closing.
@DavidKorczynski can you help me here and point me to an unreproducable testcase plus the build log of the target of the instance that produced that?
@oliverchang #7026 likely wont fix this as this is something different. I first need to check what is actually the issue. without any data (what were the build options and a testcase) I cannot start to check what the issue could be.
Concerning the test infrastructure. We have a CI etc. but the mass and complexity of targets in oss-fuzz cannot catch the many corner cases. we test with fuzzbench but a) it runs a much older llvm and b) doesnt run always reliable (like a current run is not starting).
So not sure what the solution here would be for a thorough test setup (we are individual devs that not make any money with afl++ (and dont want to), so we have no resources for a multimachine test setup for lots of targets.
@inferno-chromium @jonathanmetzman
@DavidKorczynski can you help me here and point me to an unreproducable testcase plus the build log of the target of the instance that produced that?
I couldn't get the build log for some reason but I could list the content of the afl_options.txt
- does this help? If so I list them in them below. Also, I sent you the stack trace on email.
apache-httpd_apache-httpd-address-202111280605
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=1
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_LLVM_DICT2FILE=/workspace/out/afl-address-x86_64/afl++.dict
AFL_QUIET=1
apache-httpd_apache-httpd-address-202111130610
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=1
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_LLVM_DICT2FILE=/workspace/out/afl-address-x86_64/afl++.dict
AFL_QUIET=1
apache-httpd_apache-httpd-address-202112200604
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=1
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_IGNORE_UNKNOWN_ENVS=1
AFL_LLVM_DICT2FILE=/workspace/out/afl-address-x86_64/afl++.dict
AFL_QUIET=1
apache-httpd_apache-httpd-address-202201140609
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=1
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=20
AFL_IGNORE_UNKNOWN_ENVS=1
AFL_LLVM_DICT2FILE=/workspace/out/afl-address-x86_64/afl++.dict
AFL_QUIET=1
apache-httpd_apache-httpd-address-202112110607
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=0
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_IGNORE_UNKNOWN_ENVS=1
AFL_QUIET=1
apache-httpd_apache-httpd-address-202201060600
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=0
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_IGNORE_UNKNOWN_ENVS=1
AFL_QUIET=1
apache-httpd_apache-httpd-address-202111200609
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=0
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_QUIET=1
apache-httpd_apache-httpd-address-202112310603
AFL_LLVM_MODE_WORKAROUND=0
AFL_ENABLE_DICTIONARY=0
AFL_LLVM_INSTRUMENT=CLASSIC,CTX-2
AFL_ENABLE_CMPLOG=0
AFL_LLVM_LAF_ALL=1
AFL_LAF_CHANCE=30
AFL_IGNORE_UNKNOWN_ENVS=1
AFL_QUIET=1
Over in the @curl project we've received this oss-fuzz issue over the last few days: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=41258 (limited access). It is said to happen "very frequently".
The stack trace is puzzling. None of us in the project have been able to reproduce it. We've stared at the code in question and we cannot see nor explain how the fuzzer can reach the point where it reports it hits the buffer overflow.
We're starting to lean to the explanation that this is somehow a false positive from the fuzzer. I know this is a tired old explanation that is debunked in almost all cases, but... we cannot see a better explanation!
We've published most details of the issue in https://github.com/curl/curl/issues/8041 to give everyone access and the chance to dig into the details and find the answer.