Closed terriko closed 3 years ago
My preferences: Question 1: Where should the test data be stored?
Question 2: How do we pre-parse the long tests?
Question 3: What do we do with the old file tests?
For Question 1: we can put this data in separate json file in test directory. which we will load when we run tests.
@Niraj-Kamdar that sounds like a good solution
For Question 1: even better we can use csv file to store info. Contributors can open it in excel and it would be easy to read and write.
Hm, that's an interesting thought. Previously, we've kind of assumed that people contributing checkers are fairly code-savvy (because they had to be) but with the new setup that's probably not as true. I worry that CSV isnt' the greatest solution for multi-line data, though.
That's got me thinking, though: If we're assuming a lower barrier to entry on checker writing, maybe we should start with the checker data being in pythonic arrays rather than json so that it gets covered by the Black formatting. A lot of our problem in the test cases right now is that having a huge number of beginner commits meant we weren't as careful about reviewing the alphabetization. The autoformatter might be especially valuable here. I'm sure there's an equivalent autoformatter for json we could use (and probably should eventually) but maybe we should start with what we have.
Another thought re: Question 1. How do people feel about having doctests?
https://docs.python.org/3/library/doctest.html
I feel like at least for filenames and version checking it might be really helpful. The mapping tests would probably be too unwieldly since the cve mapping happens elsewhere.
I didn't really understand your concerns about csv but if you are worried about In/ Not_in arrays. We can flatten those like following. I know it won't look great if we edit it as text file but in excel it will look more human readable and we can also leverage excel to sort csv file for us.
package, version, are_in, not_in
cups, 1.2.4, CVE-2007-5849, CVE-2005-0206
cups, 1.2.4, CVE-2007-7892, CVE-2005-0990
cups, 1.2.4, ,CVE-2004-8272
heh. You clearly have not been stuck in enterprise america if you haven't seen people screw up spreadsheets. Flattening would help, but... I don't think we actually have any particular need to support anything other than straight python code, and we'll get a slight performance improvement if we don't have to keep parsing data. Let's just stick with keeping the tests directly in some form of python code unless there's a compelling reason to do additional pasing.
Update:
Pre-parsing thoughts:
I took a quick look at our signatures, and right now the shortest ones are around 10 characters. Strings reports every string greater than 4 characters. if we up it to 8 or 10, we might be able to use that as a first-pass to do un-intelligent pre-parsing on the existing tests. I don't know if it'll reduce the size meaningfully enough yet; I'm going to run some more tests.
I think we should not store parsed strings as python list because it will create very long list and a package contains many files. So, we will loose filename information. I propose We download a package if it isn't parsed and extract it using extractor. then, we parse every file of the extracted package with our strings module and save it with the same name and compress whole directory and store it on our repo. Here's the UML for it.
Advantages of above system
Current status:
I believe the remaining issues discussed here were fixed in #1036 . If anything wasn't, please feel free to open a new issue.
With the new checker setup, we have an opportunity to modernize our tests. This was previously discussed in #638, but this issue is to summarize where we're at now:
Current state:
test_filename_is
intest_checkers.py
test_files
intest_scanner.py
test_binaries
intest_scanner.py
Problems:
mostly don't exist since this was reserved as an "easy first commit" for new contributors(Added by @Niraj-Kamdar when the checkers were updated)are a huge disorganized parametrize array right nowFixed by @SaurabhK122 in #675Solution high level ideas:
Questions:
Question 1: Where should the test data be stored?
Test data lives in test files.
e.g.
def test_bestlibrary() valid_filenames = ["libbest4.3.2.so", "best"] valid_strings = ["This is Best Library 1.1.1d"] valid_mappings = ["1.2.3", should_have=["CVE-123-1234", "CVE-123-1235"], should_not_have = ["CVE-123-1555"], ]`
test_filenames("best", valid_filenames) test_strings("best", valid_filenames) test_mappings("best", valid_mappings)
Test data lives with the checker.
valid_filenames = ["libbest4.3.2.so", "best"] # trigger filename tests
valid_strings = ["This is Best Library 1.1.1d"] # trigger mapping tests
Hybrid. Store some basic stuff (like valid filenames and a single test string) in the checker, leave longer stuff to the test suite.
Question 2: How do we pre-parse the long tests?
Question 3: What do we do with the old file tests?