Snippet Model cannot be mapped from FossID multiline range matching

nnobelis commented 1 year ago

Currently, for performance reasons (see https://github.com/oss-review-toolkit/ort/issues/7028), the matched lines are not fetched from FossID.

However it seems the current snippet model is not capable of representing certain FossID responses.

For instance, for a given snippet match in FossID, I receive the following answer for the listMatchedLines function (truncated for clarity):

"data": {
        "local_file": {
            "1": "1",
            "2": "2",
            "3": "3",
            "4": "4",
            "5": "5",
            "6": "6",
            "7": "7",
[...]
            "19": "19",
            "20": "20",
            "21": "21",
            "22": "22",
            "23": "23",
            "24": "24",
            "45": "45",
            "46": "46",
            "47": "47",
            "48": "48",
            "49": "49",
            "50": "50",
            "51": "51",
[...]
            "86": "86",
            "87": "87",
            "88": "88",
            "89": "89",
            "90": "90",
            "91": "91",
[...]
            "673": "673",
            "674": "674",
            "675": "675"

local_file means this is the lines of the source file being matched. If one "compress" these lines into line ranges, the result is: 1-24 and 45-675.

Now this information is supposed to go in SnippetFinding.sourceLocation : https://github.com/oss-review-toolkit/ort/blob/6f6e91759730ec20dc172b60612d5d76ab35c232/model/src/main/kotlin/SnippetFinding.kt#L31

Unfortunately, TextLocation in ORT can carry only two integers for the line information.

So what can be done ?

Split the snippet match in two SnippetFinding, to represent the two ranges ? This could get messy for more complicated matches.
Replace TextLocation for the snippet finding by TextRangeLocation, a new class that can specify multiple range of lines ?
Only map the first range to the TextLocation and ignore the rest (or store it in another property)?

As a side note, ScanOSS seems to deliver a single range:

    "lines": "1-710",
    "oss_lines": "1-710",

but this is only an assumption, as I couldn't reproduce more complex cases of matching.

sschuberth commented 1 year ago

Theoretically, another option would be to make sourceLocation a List<TextLocation> (where by convention the paths all must be the same; but I don't like this potential source of inconsistencies too much myself).

Split the snippet match in two SnippetFinding, to represent the two ranges ? This could get messy for more complicated matches.

This would have been my favorite. Can you explain how it could get messy? Isn't it pretty much straight forward to create one SnippetFinding for each consecutive range of lines?

sschuberth commented 1 year ago

As a semi-related question, do you know what they keys and values in local_file are? They seem to be lines numbers, but of what? Local vs matched file? Also, if these are always line numbers, i.e. integers, can we adapt our MatchedLines model accordingly?

nnobelis commented 1 year ago

This would have been my favorite. Can you explain how it could get messy? Isn't it pretty much straight forward to create one SnippetFinding for each consecutive range of lines?

You remember, we agreed to do the same in case multiple purls are delivered by the scanner. Therefore, if you get multiple purls AND multiple line ranges, you would have to do some kind of scalar product and create n*m snippet findings. However, thinking about it, I think I was just overly cautious: multiple purls are only in ScanOSS results while multiple line ranges seems to be only in FossID. I think I will go with this solution. This is the job of the Reporter anyway to aggregate these multiple snippets in a single consolidated view anyway.

As a semi-related question, do you know what they keys and values in local_file are? They seem to be lines numbers, but of what? Local vs matched file? Also, if these are always line numbers, i.e. integers, can we adapt our MatchedLines model accordingly?

Yes keys and values are always the same, this is a ugliness that FossID had in other responses too. They represent the matched lines in the local file and in the mirror file.

I am going to see if I can simplify MatchedLines to :

data class MatchedLines(
    val localFile: List<Int>,
    val mirrorFile: List<Int>
)

nnobelis commented 1 year ago

I will prepare a PR with the MatchedLines changes plus the listing of matched lines in the scanner (flag-controlled because of https://github.com/oss-review-toolkit/ort/issues/7028) and with the duplication of the snippets in case of multiple line ranges as we discussed. Looks good to you ?

sschuberth commented 1 year ago

Yes keys and values are always the same, this is a ugliness that FossID had in other responses too. They represent the matched lines in the local file and in the mirror file.

Well, if the numbers are local vs matched lines, when they probably are very often the same (and may have been the same in all responses you saw so far), but IIUC they could be different if e.g. the local file got a header stripped after being copied from the matched file. Then there would be a constant offset for all the key / value numbers in the map.

I am going to see if I can simplify MatchedLines to :

As described above, I don't believe we can to that. However, we probably can simplify to val localFile: Map<Int, Int>, plus add docs about the keys / values mean in this map.

And maybe we can do the same for mirrorFile. Any idea what that property is about?

nnobelis commented 1 year ago

There was a misunderstanding: localFile contains the local matched lines (=in your source file). It's a map with identical key / values e.g.:

"local_file": {
            "1": "1",
            "2": "2",
            "3": "3",

In mirrorFile, you get the matched lines in the snippet. They too have identical key/values.

sschuberth commented 1 year ago

In mirrorFile, you get the matched lines in the snippet.

How do you then know which line number in the localFile matches which line number in mirrorFile?

nnobelis commented 1 year ago

I made a small vid of the UI.

https://github.com/oss-review-toolkit/ort/assets/66427221/7e2d5e97-6c6c-43ac-b5a7-a29dd3f4f2d8

Notice how the highlighted sections don't "stick" with each other like in normal diff tools ?

If I had to guess, I would say FossID just highlight every matched line in the left and right panel, and that's it.

sschuberth commented 1 year ago

Hmm, so you saying that FossID does not know that line 25 on the left corresponds to line 276 on the right?

How does the view look like initially, before any scrolling? If the initial view is like at the start of your video, it does seem like FossID align these hunks, and that it does know that these lines correspond to each other.

nnobelis commented 1 year ago

Hmm, so you saying that FossID does not know that line 25 on the left corresponds to line 276 on the right?

Exactly.

How does the view look like initially, before any scrolling?

Both panels are scrolled to the first matched (=highlighted) hunk.

If the initial view is like at the start of your video, it does seem like FossID align these hunks, and that it does know that these lines correspond to each other.

Unless the UI has some basic logic to always scroll the panels to the first matched (=highlighted) hunk. The user gets the illusion that the two hunks go together (and they most likely do since they are the firsts in the source file and in the snippet).

In all case, through the API there is no way to know which hunk match which one. Therefore we have a problem here: We could split multiple line ranges in the source file matched lines (=local_file) into multiple snippets, as we discussed. But what if they are also multiple line ranges in the snippet matched lines (=remote_file)? We cannot split it again in multiple snippet because we don't know which line range in the source file match which line range in the snippet 😞 And again snippet matched lines (SnippetFinding->snippet->location) supports only a single line range.

sschuberth commented 1 year ago

Ok, here's a thought: Do we even care about the local location of the snippet in our / the scanned code? Or is it enough to know about the the remote / "mirror file" snippet location in "upstream" code? I believe our current data model is only designed to capture the latter, and that's also how SnippetFinding is documented: "A snippet finding is a code snippet from another origin".

Given that, we'd only need to look at the line ranges in mirrorFile, and disregard localFile completely.

If we'd want to capture localFile, too, then I believe the data model needs to be changed.

nnobelis commented 1 year ago

Ok, here's a thought: Do we even care about the local location of the snippet in our / the scanned code? Or is it enough to know about the the remote / "mirror file" snippet location in "upstream" code? I believe our current data model is only designed to capture the latter, and that's also how SnippetFinding is documented: "A snippet finding is a code snippet from another origin".

You forgot the part after the comma matching the code being scanned 😃 Yes we care about the local location. The user expects to see which lines in his/her code has been matched against the snippet.

If we'd want to capture localFile, too, then I believe the data model needs to be changed.

In which way ? Shall we include an optional part in TextLocation to carry additional ranges or introduce a new TextRangeLocation and use it instead in the SnippetFinding ? For the latter, we could introduce an interface to get the file (=path) out of both TextLocation and TextRangeLocation.

sschuberth commented 1 year ago

If we'd want to capture localFile, too, then I believe the data model needs to be changed.

In which way ?

I was mislead, I forgot that Snippet also has a TextLocation. So ideally, FossID.localFile maps to SnippetFinding.sourceLocation, and FossID.mirrorFile maps to Snippet.location.

But that whole model assumes that we know exactly which source location maps to which upstream location.

I currently have no idea how to handle FossID's overly generic data model here...

nnobelis commented 1 year ago

I was mislead, I forgot that Snippet also has a TextLocation. So ideally, FossID.localFile maps to SnippetFinding.sourceLocation, and FossID.mirrorFile maps to Snippet.location.

FossID.mirrorFile maps to SnippetFinding.snippet.location to be precise.

But that whole model assumes that we know exactly which source location maps to which upstream location.

I currently have no idea how to handle FossID's overly generic data model here...

Unless we change, as suggested in my previous comment, the snippet model to not carry a single location from localFile and mirrorFile matching together, but putting all locations from localFileand mirrorFile in a single SnippetFinding. In this case, we would not have to care which one match with which one.

nnobelis commented 1 year ago

An idea:

Let's say local_file lines are [1-10, 20-30, 40-50] and mirror_file lines are [1-10, 30-40].

We could do the scalar product (6 snippet findings):

SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=30-40

or even try to at least cover all the values (3 snippet findings):

SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10

or by convention iterate all the values with the first one of the other file (5 snippet findings):

SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=1-10 // first one of local_file
SnippetFinding.snippet.location=1-10 
SnippetFinding.sourceLocation=1-10 // first one of local_file
SnippetFinding.snippet.location=30-40

In the end this is not important. This is the job of the Reporter, or whatever is consuming the results, to build the information and collates everything back to [1-10, 20-30, 40-50] and [1-10, 30-40]. Building the intervals with all possible values. Which is the only information that makes sense. A single snippet finding needs to be evaluated in context from the other snippet findings. This is a bit dirty be but this won't require a model change.

Personally, I would still prefer a model change with a new TextRangeLocation: there would be less snippets and they would all make sense individually. But I can live with both solutions.

nnobelis commented 1 year ago

@sschuberth : As a temporary fix, I thought about storing the whole multi line range information in SnippetFinding.snippet.additionalData under two keys: matchedLinesSource and matchedLinesSnippet.

Then, in SnippetFinding.sourceLocation and SnippetFinding.snippet.location we can store the first line range in case of multi line ranges.

In a way, this is fitting because FossID multiline ranges is scanner-specific data. What do you think about it, I stress it, as a temporary solution ?

sschuberth commented 5 days ago

I lost a bit track of this issue, but wanted to note that by now we have https://github.com/oss-review-toolkit/ort/pull/9024. Does that somehow help to resolve this issue, @nnobelis?

nnobelis commented 4 days ago

The "temporary" fix is still active, but only for the snippet matched lines (notice the firstOrNull call):

https://github.com/oss-review-toolkit/ort/blob/a28a50396b1d4d07d120e5b42d155d754188d396/plugins/scanners/fossid/src/main/kotlin/FossIdScanResults.kt#L271-L293

In ScanOSS you addressed that by multiplying the SnippetFindings (some kind of scalar product). So yes, we still need a better solution.

oss-review-toolkit / ort

Snippet Model cannot be mapped from FossID multiline range matching #7045