[clang][analyzer] Potential Improvement for deduplication strategy in Clang StaticAnalyzer and Clang-Tidy

shenjunjiekoda commented 2 months ago

I hope this message finds you well. I'd like to bring up a discussion about the current deduplication strategy used in both Clang Static Analyzer and Clang-Tidy. I've noticed that PathDiagnosticConsumers typically use the warning or error information from a Diagnostic as the basis for deduplication. While this approach works well in many cases, I believe there might be room for improvement in certain scenarios. Allow me to illustrate with an example:

inline void XXX_memset(void *ptr, int c, size_t n) {
  if (ptr != NULL && (n != (size_t)0)) {
    memset(ptr, c, n);
  }
}

void caller1() {
  int arr[100];
  XXX_memset((void *)arr, 0, 101 * sizeof(int)); // overflow1 root cause
}

void caller2() {
  int *arr = (int *)malloc(10 * sizeof(int));
  XXX_memset((void *)arr, 0, 11 * sizeof(int)); // overflow2 root cause
}

In this code, we have two potential buffer overflow issues. However, because the sink point for both is within the XXX_memset function (specifically, the memset call), the current strategy might deduplicate these as a single issue due to identical warning locations and diagnostic messages. This is despite the fact that the notes and triggering paths differ, and the root causes actually lie in the calling functions rather than in XXX_memset.

This observation leads me to pose two main questions:

Is deduplication based solely on the analyzer's sink point information (or ASTNode location for Clang-Tidy) always appropriate? Should diagnostics with different notes be considered distinct and not deduplicated?
Could the current mechanism potentially lead to underreporting in a single invocation of Clang Static Analyzer tools? If we agree that the diagnostics in the example should be treated separately, it's because the root cause of the errors is in the parameter construction at the call sites, not at the sink point or its containing function.

For the first question, I have a preliminary suggestion: We could introduce an option in PathDiagnostic's Profile to allow choosing between using the current Profile or a FullProfile (considering all path information) for generating the FoldingSetID. If this approach seems promising, I'd be happy to submit a merge request to implement this feature.

Regarding the second question, if we consider all paths in the static analyzer, finding the root cause for appropriate deduplication becomes challenging. Simply considering all notes might not effectively deduplicate based on root causes. Perhaps we could explore a method that only uses notes from the bug visitor and relevant value slice information for deduplication?

I'm very interested in hearing the LLVM/Clang community's thoughts on these points. Thank you for your time and consideration.

llvmbot commented 2 months ago

@llvm/issue-subscribers-clang-static-analyzer

Author: JOSTAR (shenjunjiekoda)

I hope this message finds you well. I'd like to bring up a discussion about the current deduplication strategy used in both Clang Static Analyzer and Clang-Tidy. I've noticed that *PathDiagnosticConsumers* typically use the warning or error information from a Diagnostic as the basis for deduplication. While this approach works well in many cases, I believe there might be room for improvement in certain scenarios. Allow me to illustrate with an example: ```cpp inline void XXX_memset(void *ptr, int c, size_t n) { if (ptr != NULL && (n != (size_t)0)) { memset(ptr, c, n); } } void caller1() { int arr[100]; XXX_memset((void *)arr, 0, 101 * sizeof(int)); // overflow1 root cause } void caller2() { int *arr = (int *)malloc(10 * sizeof(int)); XXX_memset((void *)arr, 0, 11 * sizeof(int)); // overflow2 root cause } ``` In this code, we have two potential buffer overflow issues. However, because the sink point for both is within the `XXX_memset` function (specifically, the `memset` call), the current strategy might deduplicate these as a single issue due to identical warning locations and diagnostic messages. This is despite the fact that the notes and triggering paths differ, and the root causes actually lie in the calling functions rather than in `XXX_memset`. This observation leads me to pose two main questions: 1. Is deduplication based solely on the analyzer's sink point information (or ASTNode location for Clang-Tidy) always appropriate? Should diagnostics with different notes be considered distinct and not deduplicated? 2. Could the current mechanism potentially lead to underreporting in a single invocation of Clang Static Analyzer tools? If we agree that the diagnostics in the example should be treated separately, it's because the root cause of the errors is in the parameter construction at the call sites, not at the sink point or its containing function. For the first question, I have a preliminary suggestion: We could introduce an option in PathDiagnostic's `Profile` to allow choosing between using the current `Profile` or a `FullProfile` (considering all path information) for generating the FoldingSetID. If this approach seems promising, I'd be happy to submit a merge request to implement this feature. Regarding the second question, if we consider all paths in the static analyzer, finding the root cause for appropriate deduplication becomes challenging. Simply considering all notes might not effectively deduplicate based on root causes. Perhaps we could explore a method that only uses notes from the bug visitor and relevant value slice information for deduplication? I'm very interested in hearing the LLVM/Clang community's thoughts on these points. Thank you for your time and consideration.

NagyDonat commented 2 months ago

I'm very interested in hearing the LLVM/Clang community's thoughts on these points.

I'll try to provide some answers from the static analyzer side of the things, where I'm actively contributing since early 2023.

My impression is that there are no active plans for changing the behavior of the analyzer in these areas. (However @haoNoQ, @Xazax-hun or @steakhal may know more, e.g. about older plans.)

I knew that the current behavior of the analyzer can lead to underreporting when the report deduplication conflates several reports with logically distinct causes, and I'd guess that this possibility is common knowledge among the main contributors -- but I feel that it would be difficult to find a heuristic that is significantly more practical than the current one.

Unfortunately, the deduplication is really important, because on real-world code an "average" bug location can be reached through many different code paths (e.g. each irrelevant if before the bug introduces a bifurcation) and the analyzer will traverse many of these. Even if we only extend the profiling to the note tags (instead of the raw path choice information), we'll still profile lots of garbage information.

My impression is that duplicated copies of reports are really annoying for the user, so it's better to err on the side of more aggressive deduplication.

Szelethus commented 2 months ago

This observation leads me to pose two main questions:

Is deduplication based solely on the analyzer's sink point information (or ASTNode location for Clang-Tidy) always appropriate? Should diagnostics with different notes be considered distinct and not deduplicated?

Could the current mechanism potentially lead to underreporting in a single invocation of Clang Static Analyzer tools? If we agree that the diagnostics in the example should be treated separately, it's because the root cause of the errors is in the parameter construction at the call sites, not at the sink point or its containing function.

For the first question, I have a preliminary suggestion: We could introduce an option in PathDiagnostic's Profile to allow choosing between using the current Profile or a FullProfile (considering all path information) for generating the FoldingSetID. If this approach seems promising, I'd be happy to submit a merge request to implement this feature.

Regarding the second question, if we consider all paths in the static analyzer, finding the root cause for appropriate deduplication becomes challenging. Simply considering all notes might not effectively deduplicate based on root causes. Perhaps we could explore a method that only uses notes from the bug visitor and relevant value slice information for deduplication?

I mean, those are rather loaded questions :) @NagyDonat's response is spot on: we tend to err on the side of fewer, higher quality, preferably actionable bug reports, even at the cost of a few false negatives. In fact, in CodeChecker, which runs the static analyzer, we tend to struggle a lot more with the bug reports not being uniqued aggressively enough:

https://github.com/Ericsson/codechecker/blob/master/docs/analyzer/report_identification.md

Our experience is that for large projects, the number of reports can be in the thousands or tens of thousands, and a decent portion of these are mess. We pretty much have to bring those numbers down to make it a bit more managable.

With that said: an off-by-default (at least to start with) option to relax on the uniqueing will surely be appreciated! What you are describing is a totally legitimate worry, just happens not to be the problem Donát and I are struggling with.

shenjunjiekoda commented 2 months ago

Thank you both very much for your thoughtful responses. Now I have a better understanding of the motivation behind the current deduplication strategy. Indeed, an excessive number of reports can be quite overwhelming.

I would like to experiment by submitting an option that defaults to using the current Profile for deduplication, but when enabled, switches to using the FullProfile mechanism. This might help reduce some missed reports, particularly in projects where the volume of reports is not excessively large.

llvm / llvm-project

[clang][analyzer] Potential Improvement for deduplication strategy in Clang StaticAnalyzer and Clang-Tidy #106199