rust-fuzz / arbitrary

Generating structured data from arbitrary, unstructured input.
https://docs.rs/arbitrary/
Apache License 2.0
727 stars 73 forks source link

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Open Ekleog opened 9 months ago

Ekleog commented 9 months ago

Hey!

I've been thinking a lot about arbitrary recently, had a chat with @nagisa, and they told me to report all my thoughts here for your consumption. So here are a few thoughts, hoping to give ideas, gather feedback, and maybe eventually I'll write a PR if you agree with the concept :)

There are three main points that have been bothering me with arbitrary, and my alternating usage of both it and bolero:

  1. The fact it only supports byte slices as inputs, and not RNGs to double as proptests. The chat just now with @nagisa made me understand why that's not actually a problem but a good thing, for minimization reasons.
  2. The fact that when writing manual implementations of Arbitrary, it's very easy to get stack overflows. Just yesterday, I had an enum with enum Foo { Variant(Vec<Foo>), ... }, and got a stack overflow by doing things the naive way. This is basically this issue, and it's already been closed so I won't relitigate, but I'll just say that having a freestanding arbitrary::recursion_guard() that updates a thread-local counter on each nested call to Unstructured:: functions would probably solve the issue without complicating the Arbitrary trait itself.
  3. The one I'm actually writing this issue for. Currently, the way arbitrary generates collections is one I had not heard of yet: it tests after each element whether to take a new element. This might look good, but a very simple change (eg. changing the length of one element) can lead to a complete structural change, especially with nested collections.

In order to avoid the issue in point 3, a magic separator-based solution should help the fuzzer identify the input: if the length of one element changes, then either the input will stop parsing (if the length would go beyond the magic separator), or the input will parse with the exact same structure. In turn, this means that small changes in input lead to small changes in structure, which is a great property for fuzzers.

On the other hand, this only works for fuzzers. Such a scheme would definitely not work for "lightweight" tests like proptests, because they'd have a way too high fail-to-parse rate for that.

Hence, I'm thinking there'd be the need for at least two versions of Unstructured: one that handles collections as it does right now (stop-or-continue check), and one that handles collections with a magic delimiter (plus escaping and nested delimiters, non-trivial to implement but should be doable).

Incidentally, the version that uses a magic delimiter might hit some issues with the current Unstructured API; due to the lifetimes there that require borrowing from the Unstructured, yet unescaping the magic word would require mutating the Unstructured data (or copying it, which would be sad)

So I wanted to first ask: what do you think about the idea? :)

camshaft commented 9 months ago

I've been thinking about a third option for bolero generators for a few months now. I finally had a bit of time to write my thoughts about it: https://github.com/camshaft/bolero/issues/203.

nathaniel-brough commented 9 months ago

Not a maintainer, but I thought I'd chime in with some thoughts. Namely it'd be nice to constrain Unstructured such that it becomes easier to implement de-arbitrary at the moment this is quite a difficult problem as some of the operations in Unstructured are not easily reversible. This would make building seed-corpora easier and also enable structured custom mutators.