Remove buffer overrun from unsoundly modeled semantics

adriaanjacobs commented 1 year ago

Immediately creating a pull request this time instead of an issue first, hope that's alright.

For C/C++, "assuming that buffer overrun is impossible" is listed as an unsound behavior for points-to analyses. Although I agree that points-to analyses assume that the program does not exhibit undefined behavior by accessing memory out of bounds, I do not think this should be regarded as "unsound" modeling similarly to the others on the list. Points-to sets are often used to determine the valid set of objects a pointer may refer to in security applications, which is then somehow enforced at run time. If a "sound" points-to analysis is defined as an analysis that considers buffer overruns possible, this entire use case for points-to analysis dissappears. In fact, the security community is increasingly recognizing the soundness problems of most points-to analyses since they lead to false positives, and coming up with various fixes (e.g., CryptoMPK at S&P'22). The "unsound" modeling of buffer overruns is very much an intended effect of points-to analysis in my research area, which entirely underpins it its usefulness.

It may be that this is a terminology/definition/semantics issue that I am not aware of. I was just wondering if there was any reason to include buffer overruns in this list? Happy to close the pull request if I'm missing some use cases where they would be useful to model.

msridhar commented 1 year ago

So I think the key question here is what the assumptions are about the runtime semantics of the language. If you assume that statically-computed points-to sets will be "enforced" at runtime, so that a pointer cannot take on a value outside that set, then buffer overflows cannot be a source of analysis unsoundness, as you note. But, AFAIK most C/C++ programs are not executed with this type of runtime enforcement. So, in most cases, I do think that out-of-bounds memory accesses could lead to analysis unsoundness, in that the computed points-to sets are not necessarily an overapproximation of possible runtime behaviors. WDYT?

adriaanjacobs commented 1 year ago

I see what you mean. Although it's an interesting analysis to figure out the maximal possible offsets to certain pointers, and which objects they could access accordingly, this seems hugely dependent on memory layout, and likely results in everything aliasing. A semantics-preserving transformation like a compiler optimization pass can safely assume they will not happen, backed by the C/C++ language standards. So I suppose it's definitely valid semantics for a C/C++ pointer analysis to assume they will never happen, and consequently not consider omitting their effects "unsound".

Of course, this entire argument also holds for some of the other features on the list, notably the relative pointers. Nevertheless, from a practical standpoint, out-of-bounds accesses are much less "expected to work" than relative pointers, and enjoy much less high-level support (e.g. MSVC's "base pointer" feature). So I guess it's a matter of how pragmatic vs correct we aim to be on the website, if you agree with that classification.

Thanks for replying by the way, I just saw the list and was surprised to see out-of-bounds accesses on there. It's totally fine by me to keep it that way, having this PR and our short discussion here already provides some context to that element.

msridhar commented 1 year ago

I think a lot of these categorizations are fuzzy and require some judgement. It might be useful to categorize which tricky C/C++ features we list are undefined behavior vs. not. This way, a soundy analysis could just say "sound assuming no undefined behavior [cite], and the absence of the following features: ...". But I'd probably want another opinion (e.g. from @yanniss) before making this kind of change.

soundiness / soundiness.github.io

Remove buffer overrun from unsoundly modeled semantics #10