intel / cve-bin-tool

The CVE Binary Tool helps you determine if your system includes known vulnerabilities. You can scan binaries for over 200 common, vulnerable components (openssl, libpng, libxml2, expat and others), or if you know the components used, you can get a list of known vulnerabilities associated with an SBOM or a list of components and versions.
https://cve-bin-tool.readthedocs.io/en/latest/
GNU General Public License v3.0
1.23k stars 464 forks source link

Test improvement discussion thread #665

Closed terriko closed 3 years ago

terriko commented 4 years ago

With the new checker setup, we have an opportunity to modernize our tests. This was previously discussed in #638, but this issue is to summarize where we're at now:

Current state:

Problems:

Solution high level ideas:

Questions:

terriko commented 4 years ago

My preferences: Question 1: Where should the test data be stored?

Question 2: How do we pre-parse the long tests?

Question 3: What do we do with the old file tests?

Niraj-Kamdar commented 4 years ago

For Question 1: we can put this data in separate json file in test directory. which we will load when we run tests.

terriko commented 4 years ago

@Niraj-Kamdar that sounds like a good solution

Niraj-Kamdar commented 4 years ago

For Question 1: even better we can use csv file to store info. Contributors can open it in excel and it would be easy to read and write.

terriko commented 4 years ago

Hm, that's an interesting thought. Previously, we've kind of assumed that people contributing checkers are fairly code-savvy (because they had to be) but with the new setup that's probably not as true. I worry that CSV isnt' the greatest solution for multi-line data, though.

That's got me thinking, though: If we're assuming a lower barrier to entry on checker writing, maybe we should start with the checker data being in pythonic arrays rather than json so that it gets covered by the Black formatting. A lot of our problem in the test cases right now is that having a huge number of beginner commits meant we weren't as careful about reviewing the alphabetization. The autoformatter might be especially valuable here. I'm sure there's an equivalent autoformatter for json we could use (and probably should eventually) but maybe we should start with what we have.

terriko commented 4 years ago

Another thought re: Question 1. How do people feel about having doctests?

https://docs.python.org/3/library/doctest.html

I feel like at least for filenames and version checking it might be really helpful. The mapping tests would probably be too unwieldly since the cve mapping happens elsewhere.

Niraj-Kamdar commented 4 years ago

I didn't really understand your concerns about csv but if you are worried about In/ Not_in arrays. We can flatten those like following. I know it won't look great if we edit it as text file but in excel it will look more human readable and we can also leverage excel to sort csv file for us.

package, version, are_in, not_in
cups, 1.2.4, CVE-2007-5849, CVE-2005-0206
cups, 1.2.4, CVE-2007-7892, CVE-2005-0990
cups, 1.2.4, ,CVE-2004-8272
terriko commented 4 years ago

heh. You clearly have not been stuck in enterprise america if you haven't seen people screw up spreadsheets. Flattening would help, but... I don't think we actually have any particular need to support anything other than straight python code, and we'll get a slight performance improvement if we don't have to keep parsing data. Let's just stick with keeping the tests directly in some form of python code unless there's a compelling reason to do additional pasing.

terriko commented 4 years ago

Update:

terriko commented 4 years ago

Pre-parsing thoughts:

I took a quick look at our signatures, and right now the shortest ones are around 10 characters. Strings reports every string greater than 4 characters. if we up it to 8 or 10, we might be able to use that as a first-pass to do un-intelligent pre-parsing on the existing tests. I don't know if it'll reduce the size meaningfully enough yet; I'm going to run some more tests.

Niraj-Kamdar commented 4 years ago

I think we should not store parsed strings as python list because it will create very long list and a package contains many files. So, we will loose filename information. I propose We download a package if it isn't parsed and extract it using extractor. then, we parse every file of the extracted package with our strings module and save it with the same name and compress whole directory and store it on our repo. Here's the UML for it. long_test_parser_uml

Advantages of above system

terriko commented 4 years ago

Current status:

terriko commented 3 years ago

I believe the remaining issues discussed here were fixed in #1036 . If anything wasn't, please feel free to open a new issue.