nexB / scancode-analyzer

scancode-results-analyzer
4 stars 2 forks source link

Scan analysis #20

Closed pombredanne closed 3 years ago

pombredanne commented 3 years ago

I would like to have an API that behaves this way:

  1. header level: addition that this tool has been running
  2. file-level: some data to design that tell if there is a license detection issue and what it is
pombredanne commented 3 years ago

In hindsight I think we would still want to have this as a scancode postscan plugin with the caveat that this would be a Linux-only plugin and that this would not be installed by default. This would make the integration in scancode.io much simpler too https://github.com/nexB/scancode.io/issues/30

AyanSinhaMahapatra commented 3 years ago

@pombredanne I think it is easier to integrate this with scancode.io than as a scancode post-scan plugin simply because of the number of dependencies this would have. Even if we make a bare-bones approach without the NLP models, it would be a lot of dependencies. It would be easier to integrate given that it is a post-scan plugin, but the process of managing this as a post-scan plugin would be harder, given that I assume post-scan plugin dependencies are not handled separately?

AyanSinhaMahapatra commented 3 years ago

Secondly, there are two options for the result format, and I'll show you in detail with an example JSON file(s) shortly, but there are some issues with this.

The scancode JSON format has a list of files, where there is a list of dictionaries, with each file in a dictionary, with relevant attributes and corresponding values. Now as they are mostly correct, it would be extra information per file (and as of now how repetition is handled has to be changed), as most files won't have a problem. So I was thinking more of a consolidated approach, as the data from the summary plugins, another list with all the problems, one dictionary per problem (unique), and with some more summary stats and header additions. They would also be "file-level", but would not be there with the scancode file information, it will be there separately. IMHO that makes more sense, because there is a JSON->DataFrame conversion at start, and again converting and adding the scan analysis information would require lookups. Anyway, we should have a discussion once the output format is ready.