Coming up with and implementing a proper structure for parsers

XDRAGON2002 commented 2 years ago

While going through #1526 I was thinking about how this could lead to support for multiple parsers, but as the current structure stands, it is a bit too random (not inefficient, just feels scattered). The parsers that are currently supported, all take a similar input (a file format that needs to be parsed) and output similarly as well (product info, triage data) that is then passed to the cve_scanner in cli.py.

The parsers act as black boxes with respect to each other and other components, but these sets of black boxes also have common input format, output format and most importantly the same thought process (implementation could differ a bit). Hence following the Object Orientated Paradigm I feel we can structure this based on overriding the class methods.

This would look something like having a parsers directory and to add a new parser to the tool, one could simply add a new file, inherit from the parent parser class, override the class with appropriate parameters and that's about it! This is very (if not completely) similar to the system currently in use for adding checkers (add file to checkers directory, inherit from parent checker class, override the data members and you're good to go).

It would lead to a general system for adding new parsers and support for new package file formats, all the while abstracting away the implementation details for the black box approach even more and also reduce complexity of code (which would allow even more people to work on these parsers, again similar to checkers), also would allow a better method to write tests for these modules.

We could also look into creating a wrapper around all these parsers after we restructure appropriately (something like a parse() function) that simply takes in the input file, and passes it to the appropriate parser depending on file properties (requirements.txt,*.json,package.json etc) this would improve abstraction further and also cleanup the codebase a bit (currently parsers are being supplied their inputs directly from cli.py).

This thread is for just an overview of the idea, would love to figure out the intricacies of this over time with proper discussion, approval and help from the community.

terriko commented 2 years ago

I like the line of thinking here, but there's huge chasms of bad ideas that happen when you talk about building an abstraction layer. So I'm glad you opened a discussion thread for this!

Some thoughts off the top of my head:

We honestly don't want to be in the business of writing parsers at all if we don't have to. We already try to use external well tested libraries for xml, json, and hopefully will be doing so for sbom as more mature tools become available. I can see writing our own parsers to conform to a bit of an API, but will it be a pain if we have to maintain wrapper layers for external libraries to conform to that same API?
What exactly are you talking about as parsers in this context? We probably need a name that reflects what we're parsing, because "parser" could mean just about anything. Are we talking component list parsers only? Are we talking about things like rpms and gzip files and jar files as well?
We already have abstraction layers for checkers, input and output. You're mostly talking about stuff that falls into the input category, I think, which handles the .csv and triage files. Are those fundamentally different than parsing requirements.txt? I think probably not -- it would be nice for the user if cve-bin-tool -i requirements.txt worked just as well as cve-bin-tool -i dependencies.csv. Should we be glomming the SBOMs in as well so it also works through -i?
Do we think it's likely that people will want to specify multiple component lists and have them in the same report? How should that work? (right now I think if you generate each into a .csv report or somesuch you can then merge them)

XDRAGON2002 commented 2 years ago

By "parsers" here I am referring to reading package data (as referenced in #1526).

I completely agree with your third point, having a simple input flag that handles the file type on its own (especially for SBOMs) would be great for usability, as the tool is designed to be a swift check (not a pure in depth analysis) abstracting the input flag formats would be a great feature addition.

As to how likely it is that people would want to specify multiple components and have them merged? I suppose that would differ from person to person and more importantly from use-case to use-case. If need be, the merge functionality could be used to merge the lists together. A discussion with the community regarding this might be extremely fruitful in coming up with how we could improve report generation (or if changes are even required).

Having a default API for parsing package data would be really helpful in my opinion, as even though cve-bin-tool is a quick CVE scan to be integrated with CI/CD workflows, I can easily see this snowball into a general component listing functionality (which is something that is outside the scope of this thread as of now) if we continue adding support for more package formats/handlers. Regarding how much work it would go into creating endpoints for external libraries, I don't think it should be that difficult, but would definitely be an addition worthy of a one time commitment. Moreover as the project continues to grow and its scale increases, at that point it would be much more tedious to restructure the codebase if need be, hence the suggestion to work on this now and make it future proof (I know things could completely change in the future, but as of now I feel this would be a good decision to work on), hence the idea of this modal structuring. The pros outweigh the cons here in my opinion.

Also as one of the GSOC projects this year deals with extending support to new package managers (again #1526), I believe this restructure could very well go hand in hand with that project idea. I believe this should be doable within the timeframe of a 175hr project as an add-on to the original idea, unless I'm being a bit too ambitious in that regard.

Looking for thoughts and feedback from the community on this.

terriko commented 2 years ago

Multiple component lists: I'd guess this is reasonably common. Even cve-bin-tool itself has both python and javascript package lists. Python folk often also have C components for performance reasons. I've seen a number of hybrid java and C projects as well.

Merging results from multiple component lists: It depends on the type of audit system people use for this kind of thing. The tools I use internally at Intel accept files to save for audit, but I'm not familiar with all the FIPS/Realtime/ISO other certification tools and the like used by other folk who care about keeping evidence of scans. I suspect the US government initiative on security (the same one that's pushed SBOMs to the forefront of many people's minds) may start a rise of new/updated tracking tools. too .

Currently, we support SBOMs and other component lists which could be from combined sources, and we have a merge ability, so I think we cover things ok. But we probably want to have it in the back of our minds as something that might get requested in the future. Which honestly supports your assertion that we should have a common API: if every thing generated the same format of data, merging them would like just be a simple python operation to combine data structures.

terriko commented 2 years ago

As for GSoC: yes, this would be a good addition to #1526. I'd guess it's doable within a 175hr stint for someone who knows the data structures and has a plan before coding starts because there's already a few component list parsing bits to examine for commonalities. I'd prefer to see it in a 350hr project so that you could also work on more new data sources afterwards (and iterate on the API if needed for new sources), but I don't think it's absolutely required.

XDRAGON2002 commented 2 years ago

As for GSoC: yes, this would be a good addition to #1526. I'd guess it's doable within a 175hr stint for someone who knows the data structures and has a plan before coding starts because there's already a few component list parsing bits to examine for commonalities. I'd prefer to see it in a 350hr project so that you could also work on more new data sources afterwards (and iterate on the API if needed for new sources), but I don't think it's absolutely required.

Maybe working on it as a 350hr project might actually be better than my earlier suggestion of a 175hr project, I guess I'll have to look into getting a 350hr project timeline extension from my college (if my proposal even gets considered or accepted), but as of now I am planning to contain it within a 175hr timeframe (as I have a fair bit of experience with python-pip, js-npm, rust-cargo, go-(earlier there was no official package management but now "go mod" is starting to become the default)).

Also what other factors might need to be discussed upon before I try to think of a barebones abstraction for the API that we could iterate over for improvements part by part? Or should I even be trying to think of an implementation as of now (as this is supposed to be a GSOC project)?

Would love to hear opinions from the community regarding this.

XDRAGON2002 commented 2 years ago

I believe that after the refactor of package parsers, #1265 and #1090 would be freshly added and the other package parsers (python, java, js) would have to be updated to conform to the format decided.

My query was pertaining to package lists (existing ones and #1271) and also to SBOMs, should they all be covered under the same refactored API? Or let them be as is? Or group languages together and package lists together? Or maybe even include the input flag (csv,json) under the same API?

Looking for suggestions from other contributors as this distinction will be crucial in planning the implementational structure of the project.

XDRAGON2002 commented 1 year ago

@terriko I guess this can be closed now too?

terriko commented 1 year ago

Yes, thanks for pointing it out!

intel / cve-bin-tool

Coming up with and implementing a proper structure for parsers #1537