FossID API for matched line needs to be improved.

nnobelis commented 1 year ago

This ticket is following a discussion at https://github.com/oss-review-toolkit/ort/pull/7022#discussion_r1200165856.

When listing the snippets through FossID API, FossID return each snippet with such a payload (in the highlighting property) :

{
  "id": "410668b1f35f8b27ff9ce345998448b6",
  "local_coverage": 0.9754,
  "local_highlight": {
    "blocks": [
      {
        "byte_range": {
          "begin": 0,
          "end": 712
        },
        "char_range": {
          "begin": 0,
          "end": 712
        },
        "id": "abdc2b929a1b84f24155c27b752944ab"
      },
      {
        "byte_range": {
          "begin": 1395,
          "end": 28013
        },
        "char_range": {
          "begin": 1395,
          "end": 28013
        },
        "id": "a11cb7ad7af8d915193131a92d514ed7"
      }
    ],
    "encoding": "UTF-8",
    "id": "3673e848c2d349e2f054691c952b3f2f",
    "pfm_format": 2
  },
  "local_size": 475,
  "remote_coverage": 1,
  "remote_highlight": {
    "blocks": [
      {
        "byte_range": {
          "begin": 0,
          "end": 27333
        },
        "char_range": {
          "begin": 0,
          "end": 27333
        },
        "id": "871d76314c0e746c1b33d63e6c05a909"
      }
    ],
    "encoding": "UTF-8",
    "id": "410668b1f35f8b27ff9ce345998448b6",
    "pfm_format": 2
  },
  "remote_size": 475
}

This should allow to get the matched lines between the source file and the snippet. Unfortunately, this is only character range information, not line range. To get the matched lines, one has to call files_and_folders/get_matched_lines with the source file name and the snippet id. Then FossID returns the matched lines equivalent.

Indeed, the FossID API is designed in such a way that, getting the matched lines of a snippet requires a separate query to the API server.

Therefore the workflow is :

For a given scancode list pending files scans/get_pending_files
For each pending file, list the snippets files_and_folders/get_fossid_results
For each snippet with partial match, list the matched lines files_and_folders/get_matched_lines.

These are way to much requests as we have scans with 2000 pending files! For such scans, we need more than 10000 requests to fetch all snippets data (snippet + matching lines).

FossID should provide an API to batch these operations. For instance:

List all snippets for all pending files or for a list of files.
List all matched lines for all snippets with partial match, or for a list of snippet ids.

Note: the fossid-cli proprietary tool seems to have a better performance for this, with the --sensitivity option. Is there an unofficial API to perform what we want to do ?

For what it's worth, the ticket FOSSIDSC-3099 has been opened at FossID support (access requires account).

nnobelis commented 1 year ago

We received an answer from FossID:

We are grateful for this input, it has become a roadmap candidate for our roadmap planning.

sschuberth commented 2 weeks ago

@nnobelis are you aware of any updates on the FOSSID side?

nnobelis commented 2 weeks ago

No, none at all :(

oss-review-toolkit / ort

FossID API for matched line needs to be improved. #7028