Seed corpus not being used for `aspell` project

kevina commented 5 years ago

My project aspell does not seam to be using the seed corpus. About two days ago I expanded the seed corpus to improve coverage and yet coverage has not changed.

The files are currently not named based on the sha1 checksum. Is this a requirement? The manual strongly hints at this when it says:

The name of each file in the corpus is the sha1 checksum (which you can get using the sha1sum or shasum comand) of its contents.

inferno-chromium commented 5 years ago

No there is no name requirement. i see 16260 Aug 19 19:32 aspell_fuzzer_seed_corpus.zip in your gs://clusterfuzz-builds/aspell/aspell-address-201908200228.zip Is this the right seed corpus or getting archived properly ?

Also, check coverage locally with https://google.github.io/oss-fuzz/advanced-topics/code-coverage

kevina commented 5 years ago

I checked the seed corpus locally and the coverage was around 56%

That is the correct file, unzip -l aspell_fuzzer_seed_corpus.zip gives:

  Length      Date    Time    Name
---------  ---------- -----   ----
      132  2019-08-20 02:30   aspell_fuzzer_corpus/email000
      108  2019-08-20 02:30   aspell_fuzzer_corpus/en_US-bad-spellers
      114  2019-08-20 02:30   aspell_fuzzer_corpus/en_US-fast
      116  2019-08-20 02:30   aspell_fuzzer_corpus/en_US-normal
      114  2019-08-20 02:30   aspell_fuzzer_corpus/en_US-slow
      115  2019-08-20 02:30   aspell_fuzzer_corpus/en_US-ultra
       87  2019-08-20 02:30   aspell_fuzzer_corpus/en_us_input
       86  2019-08-20 02:30   aspell_fuzzer_corpus/en_us_input_utf8
     2213  2019-08-20 02:30   aspell_fuzzer_corpus/html000
       65  2019-08-20 02:30   aspell_fuzzer_corpus/markdown001
...
---------                     -------
     7253                     60 files

Should the files inside the zip be in there own directory?

inferno-chromium commented 5 years ago

when we unpack, we give it to libfuzzer/afl, which does not care directory structure. are you saying that coverage on fuzzer stats dash is lower than 60% ?

kevina commented 5 years ago

Yes, it currently at around 51% and coverage for aspell/modules/filter/markdown.cpp is 0%. if it was using the seed corpus that should be around 85%: https://storage.googleapis.com/oss-fuzz-coverage/aspell/reports/20190820/linux/src/aspell/modules/filter/report.html

jonathanmetzman commented 5 years ago

This is a bit of mystery to me. Coverage build isn't broken and we seem to be unpacking the corpus based on the logs I see. Totally speculating: some other things we should look into

If something weird happening with coverage.
If the wrong seed corpus is unpacked (maybe build is old?).
If we are not unpacking seed corpus to correct place.

jonathanmetzman commented 5 years ago

Actually, it looked like seed corpus was unpacked on the 17th, 18th, and 20th (when it started taking longer to unpack). So I'm predicting that the next coverage report that gets generated will cover code that your seed corpus covers. I'm not sure why it didn't unpack on the 19th.

kevina commented 5 years ago

The seed corpus is still rather small so I am sure why it would take so long to unpack. I give it another day then.

Is there a place I can look to tell if the seed corpus was unpacked?

jonathanmetzman commented 5 years ago

The seed corpus is still rather small so I am sure why it would take so long to unpack. I give it another day then.

It isn't taking long, it's taking longer As in: 20th: 0.108174085617 seconds 18th: 0.000488996505737 17th: 0.000529050827026

There's nothing to worry about this, I was just pointing out for myself that it looks like a new seed corpus is being unpacked.

Is there a place I can look to tell if the seed corpus was unpacked?

You could download your project's corpus (use gsutil to download from gs://aspell-corpus.clusterfuzz-external.appspot.com) and do a coverage report on that (the names will be changed to sha hashes so you can't simply look for names).

I'm 99% sure that this has nothing to do with the names of the files. I think it's more likely that this problem was caused by something like pruning failing on the 19th, a problem that will go away.

kevina commented 5 years ago

You could download your project's corpus and do a coverage report on that (the names will be changed to sha hashes so you can't simply look for names).

Should the files from the seed corpus always be included when downloading the corpus via: gs://aspell-corpus.clusterfuzz-external.appspot.com/libFuzzer/aspell_fuzzer or gs://clusterfuzz-builds/aspell/aspell-address-DATE.zip?

jonathanmetzman commented 5 years ago

Should the files from the seed corpus always be included

These are very different things let me explain.

gs://aspell-corpus.clusterfuzz-external.appspot.com/libFuzzer/aspell_fuzzer should contain the working corpus (i.e. all of the files added during fuzzing plus the pruned corpus from the night before). You can actually get a copy of the backup we make after pruning here. We can't guarantee that it will contain all of the seeds since we remove redundant/reduced ones during pruning.

gs://clusterfuzz-builds/aspell/aspell-address-DATE.zip contains the build which should include the seed corpus if you added it correctly. I think it was only brought up since there was a question about whether you added it correctly (you did).

I think the latest coverage report was generated at Tuesday 9 AM in UTC-4, that is before the seed corpus was unpacked (Tuesday 12:15 PM PDT).

jonathanmetzman commented 5 years ago

The newest report shows coverage at ~56%

Markdown is still 0%. Are you sure it should be covered? If so I can try to take another look.

kevina commented 5 years ago

gs://clusterfuzz-builds/aspell/aspell-address-DATE.zip contains the build

Oops I meant gs://aspell-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/aspell_fuzzer/latest.zip. Sorry. But you answered the question about what that corpus contains anyway.

kevina commented 5 years ago

Are you sure it should be covered? If so I can try to take another look.

Yes. The files named markdown001 to markdown050 in the seed corpus should test the markdown filter.

Here is what the coverage looks like when using just the seed corpus (python infra/helper.py coverage aspell --fuzz-target aspell_fuzzer --corpus-dir build/out/aspell/src/aspell-fuzz/aspell_fuzzer_corpus)

filter-coverage

jonathanmetzman commented 5 years ago

Assigning to this week's sheriff.

kevina commented 5 years ago

Just to add another data point, I noticed that in the coverage report for 2019-08-21 email.cpp was at 75% of lines coverage report for 2019-08-22 is was back down to 0%. There is some input in the seed corpus for the email filter but I think the fuzzier stumbled upon the setting string to activate it own its own.

kevina commented 5 years ago

After closer examination of the corpus I determined that the fuzzer did use the seed corpus (as it used pt_BR-001 which used the pr_BR dictionary that uses features that en_US does not); however, it apparently found the input that uses the Markdown filter code uninteresting. There was also one input file for the Email filter, it used this for a day, but after that it also found it uninteresting after that.

jonathanmetzman commented 5 years ago

it apparently found the input that uses the Markdown filter code uninteresting

I'm pretty sure there's a bug somewhere here (probably in CF), otherwise this input would be considered interesting and would be in the corpus. I did coverage reports locally and confirmed that 1. the seed corpus covers markdown.cpp 2. The working corpus does not cover markdown.cpp. 3. The working corpus plus the seed covers markdown.cpp. I bet if I copy the files from the seed corpus into the working corpus's cloud bucket it will start being covered.

jonathanmetzman commented 5 years ago

. There was also one input file for the Email filter, it used this for a day, but after that it also found it uninteresting after that.

As in coverage went down for Email?

kevina commented 5 years ago

As in coverage went down for Email?

Yes. In the coverage report for 2019-08-21 email.cpp was at 79% of lines covered and in the report for 2019-08-22 is was back down to 0%. I first thought the fuzzer stumbled upon the right settings on it's own (like with the TeX filter), but the coverage numbers for email.cpp match exactly what they do when just using the seed corpus.

Email coverage on 2019-08-21: iemail coverage

Dor1s commented 5 years ago

Interesting, that the number of units in the corpus backup on 22nd was higher than on 21st:

https://storage.googleapis.com/oss-fuzz-coverage/aspell/reports/20190821/linux/src/aspell-fuzz/aspell_fuzzer.cpp.html

https://storage.googleapis.com/oss-fuzz-coverage/aspell/reports/20190822/linux/src/aspell-fuzz/aspell_fuzzer.cpp.html

Which is good and expected, but how could those files covering e.g. email.cpp file disappear...

Dor1s commented 5 years ago

Checked coverage job logs -- nothing suspicious in there.

oliverchang commented 5 years ago

I took a look at the recent corpus pruning logs but also don't see anything obviously wrong. Could there be any nondeterminism coming from the target?

kevina commented 5 years ago

Could there be any nondeterminism coming from the target?

There shouldn't be. If there is I would consider it a bug.

Dor1s commented 5 years ago

That's an interesting point. AFL reports only 25-40% stability https://oss-fuzz.com/fuzzer-stats/by-day/date-start/2019-08-15/date-end/2019-08-28/fuzzer/afl_aspell_fuzzer/job/afl_asan_aspell

kevina commented 5 years ago

AFL reports only 25-40% stability

@Dor1s are you trying to tell me my target in behaving nondeterministically?

If so, is there a way to find testcases that create different output when run multiple times?

Dor1s commented 5 years ago

@kevina I can't guarantee that's the case, but based on AFL logic for evaluating "stability" it does seem to recognize many parts of the target as non-deterministic :/

I'm trying a couple things locally, will get back to you if I realize anything useful.

cmeister2 commented 5 years ago

https://github.com/ocaml/ocaml/issues/7612 indicates that caching might cause afl to report a target as unstable. Maybe the issue is the GlobalCacheBase?

jonathanmetzman commented 5 years ago

The instability would have to be preventing Mardown.cpp from being reached deterministically.

jonathanmetzman commented 5 years ago

I'm gonna test my theory that seed unpacking is broken (and not merging) by copying each file from the seed corpus into the working corpus. If the coverage improvements happen in the next report, then unpacking is broken.

Dor1s commented 5 years ago

I'm trying a couple things locally, will get back to you if I realize anything useful.

I was trying to do corpus minimization differently, but didn't notice anything suspicious.

jonathanmetzman commented 5 years ago

Another theory: @kevina the dict/ directory in the build seems to control whether markdown.cpp is covered (I tried removing it and did a coverage report on the seed corpus, markdown.cpp is no longer covered). So if the coverage report tommorrow doesn't show markdown.cpp is covered (I explicitly added the seed corpus to the working corpus), then this is my best guess.

kevina commented 5 years ago

@jonathanmetzman the dict/ directory is required for any coverage. Without it Aspell won't find the needed data files (including the speller dictionary) and will return an error.

jonathanmetzman commented 5 years ago

@jonathanmetzman the dict/ directory is required for any coverage. Without it Aspell won't find the needed data files (including the speller dictionary) and will return an error.

OK, so tommorow we should see the coverage report containing coverage of markdown.cpp and we can see why our unpacking is broken

jonathanmetzman commented 5 years ago

@jonathanmetzman the dict/ directory is required for any coverage. Without it Aspell won't find the needed data files (including the speller dictionary) and will return an error.

OK, so tommorow we should see the coverage report containing coverage of markdown.cpp and we can see why our unpacking is broken

New coverage report doesn't cover markdown.cpp

So something is up with pruning or this target behaves weirdly.

nwellnhof commented 4 years ago

I'm seeing a similar issue with libxml2. I expanded the seed corpus of the xml fuzzer two weeks ago, but the coverage report still shows quite a few code blocks as uncovered which really should be covered now.

Dor1s commented 4 years ago

@nwellnhof is that still the case? There was some regression in LLVM affecting code coverage tools (https://github.com/google/oss-fuzz/issues/4348).

Could you take another look at the stats and let us know if you still see that missing coverage?

https://oss-fuzz.com/fuzzer-stats?group_by=by-day&date_start=2020-08-01&date_end=2020-08-31&fuzzer=libFuzzer&job=libfuzzer_asan_libxml2&project=libxml2

If possible, please start a new issue for that (if the problem is still present).

nwellnhof commented 4 years ago

Coverage looks good now.

Dor1s commented 4 years ago

Thanks for checking!

cmeister2 commented 4 years ago

Should this thread be being closed? I don't think @kevina has responded saying the problem is fixed for aspell...

On Thu, 3 Sep 2020, 18:33 Max Moroz, notifications@github.com wrote:

Closed #2729 https://github.com/google/oss-fuzz/issues/2729.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/oss-fuzz/issues/2729#event-3725829091, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPA36MFB2G7PAMODXUYNADSD7HNDANCNFSM4IN3U6DQ .

Dor1s commented 4 years ago

@cmeister2 good catch! Last time aspell was discussed here over a year ago, and in the current reports I see markdown.cpp file being covered: https://storage.googleapis.com/oss-fuzz-coverage/aspell/reports/20200909/linux/src/aspell/modules/filter/markdown.cpp.html

@kevina please comment / re-open if the issue still persists for you.

google / oss-fuzz

Seed corpus not being used for `aspell` project #2729