Open nnobelis opened 1 year ago
Theoretically, another option would be to make sourceLocation
a List<TextLocation>
(where by convention the path
s all must be the same; but I don't like this potential source of inconsistencies too much myself).
Split the snippet match in two
SnippetFinding
, to represent the two ranges ? This could get messy for more complicated matches.
This would have been my favorite. Can you explain how it could get messy? Isn't it pretty much straight forward to create one SnippetFinding
for each consecutive range of lines?
As a semi-related question, do you know what they keys and values in local_file
are? They seem to be lines numbers, but of what? Local vs matched file? Also, if these are always line numbers, i.e. integers, can we adapt our MatchedLines
model accordingly?
This would have been my favorite. Can you explain how it could get messy? Isn't it pretty much straight forward to create one
SnippetFinding
for each consecutive range of lines?
You remember, we agreed to do the same in case multiple purls are delivered by the scanner. Therefore, if you get multiple purls AND multiple line ranges, you would have to do some kind of scalar product and create n*m
snippet findings.
However, thinking about it, I think I was just overly cautious: multiple purls are only in ScanOSS results while multiple line ranges seems to be only in FossID.
I think I will go with this solution. This is the job of the Reporter anyway to aggregate these multiple snippets in a single consolidated view anyway.
As a semi-related question, do you know what they keys and values in local_file are? They seem to be lines numbers, but of what? Local vs matched file? Also, if these are always line numbers, i.e. integers, can we adapt our MatchedLines model accordingly?
Yes keys and values are always the same, this is a ugliness that FossID had in other responses too. They represent the matched lines in the local file and in the mirror file.
I am going to see if I can simplify MatchedLines
to :
data class MatchedLines(
val localFile: List<Int>,
val mirrorFile: List<Int>
)
I will prepare a PR with the MatchedLines
changes plus the listing of matched lines in the scanner (flag-controlled because of https://github.com/oss-review-toolkit/ort/issues/7028) and with the duplication of the snippets in case of multiple line ranges as we discussed.
Looks good to you ?
Yes keys and values are always the same, this is a ugliness that FossID had in other responses too. They represent the matched lines in the local file and in the mirror file.
Well, if the numbers are local vs matched lines, when they probably are very often the same (and may have been the same in all responses you saw so far), but IIUC they could be different if e.g. the local file got a header stripped after being copied from the matched file. Then there would be a constant offset for all the key / value numbers in the map.
I am going to see if I can simplify
MatchedLines
to :
As described above, I don't believe we can to that. However, we probably can simplify to val localFile: Map<Int, Int>
, plus add docs about the keys / values mean in this map.
And maybe we can do the same for mirrorFile
. Any idea what that property is about?
There was a misunderstanding:
localFile
contains the local matched lines (=in your source file). It's a map with identical key / values e.g.:
"local_file": {
"1": "1",
"2": "2",
"3": "3",
In mirrorFile
, you get the matched lines in the snippet. They too have identical key/values.
In
mirrorFile
, you get the matched lines in the snippet.
How do you then know which line number in the localFile
matches which line number in mirrorFile
?
I made a small vid of the UI.
https://github.com/oss-review-toolkit/ort/assets/66427221/7e2d5e97-6c6c-43ac-b5a7-a29dd3f4f2d8
Notice how the highlighted sections don't "stick" with each other like in normal diff tools ?
If I had to guess, I would say FossID just highlight every matched line in the left and right panel, and that's it.
Hmm, so you saying that FossID does not know that line 25 on the left corresponds to line 276 on the right?
How does the view look like initially, before any scrolling? If the initial view is like at the start of your video, it does seem like FossID align these hunks, and that it does know that these lines correspond to each other.
Hmm, so you saying that FossID does not know that line 25 on the left corresponds to line 276 on the right?
Exactly.
How does the view look like initially, before any scrolling?
Both panels are scrolled to the first matched (=highlighted) hunk.
If the initial view is like at the start of your video, it does seem like FossID align these hunks, and that it does know that these lines correspond to each other.
Unless the UI has some basic logic to always scroll the panels to the first matched (=highlighted) hunk. The user gets the illusion that the two hunks go together (and they most likely do since they are the firsts in the source file and in the snippet).
In all case, through the API there is no way to know which hunk match which one. Therefore we have a problem here:
We could split multiple line ranges in the source file matched lines (=local_file
) into multiple snippets, as we discussed.
But what if they are also multiple line ranges in the snippet matched lines (=remote_file
)?
We cannot split it again in multiple snippet because we don't know which line range in the source file match which line range in the snippet 😞
And again snippet matched lines (SnippetFinding->snippet->location
) supports only a single line range.
Ok, here's a thought: Do we even care about the local location of the snippet in our / the scanned code? Or is it enough to know about the the remote / "mirror file" snippet location in "upstream" code? I believe our current data model is only designed to capture the latter, and that's also how SnippetFinding
is documented: "A snippet finding is a code snippet from another origin".
Given that, we'd only need to look at the line ranges in mirrorFile
, and disregard localFile
completely.
If we'd want to capture localFile
, too, then I believe the data model needs to be changed.
Ok, here's a thought: Do we even care about the local location of the snippet in our / the scanned code? Or is it enough to know about the the remote / "mirror file" snippet location in "upstream" code? I believe our current data model is only designed to capture the latter, and that's also how
SnippetFinding
is documented: "A snippet finding is a code snippet from another origin".
You forgot the part after the comma matching the code being scanned
😃
Yes we care about the local location. The user expects to see which lines in his/her code has been matched against the snippet.
If we'd want to capture
localFile
, too, then I believe the data model needs to be changed.
In which way ?
Shall we include an optional part in TextLocation
to carry additional ranges or introduce a new TextRangeLocation
and use it instead in the SnippetFinding
?
For the latter, we could introduce an interface to get the file (=path
) out of both TextLocation
and TextRangeLocation
.
If we'd want to capture
localFile
, too, then I believe the data model needs to be changed.In which way ?
I was mislead, I forgot that Snippet
also has a TextLocation
. So ideally, FossID.localFile
maps to SnippetFinding.sourceLocation
, and FossID.mirrorFile
maps to Snippet.location
.
But that whole model assumes that we know exactly which source location maps to which upstream location.
I currently have no idea how to handle FossID's overly generic data model here...
I was mislead, I forgot that
Snippet
also has aTextLocation
. So ideally,FossID.localFile
maps toSnippetFinding.sourceLocation
, andFossID.mirrorFile
maps toSnippet.location
.
FossID.mirrorFile
maps to SnippetFinding.snippet.location
to be precise.
But that whole model assumes that we know exactly which source location maps to which upstream location.
I currently have no idea how to handle FossID's overly generic data model here...
Unless we change, as suggested in my previous comment, the snippet model to not carry a single location from localFile
and mirrorFile
matching together, but putting all locations from localFile
and mirrorFile
in a single SnippetFinding
.
In this case, we would not have to care which one match with which one.
An idea:
Let's say local_file
lines are [1-10, 20-30, 40-50] and mirror_file
lines are [1-10, 30-40].
We could do the scalar product (6 snippet findings):
SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=30-40
or even try to at least cover all the values (3 snippet findings):
SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=30-40
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10
or by convention iterate all the values with the first one of the other file (5 snippet findings):
SnippetFinding.sourceLocation=1-10
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=20-30
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=40-50
SnippetFinding.snippet.location=1-10 // first one of mirror_file
SnippetFinding.sourceLocation=1-10 // first one of local_file
SnippetFinding.snippet.location=1-10
SnippetFinding.sourceLocation=1-10 // first one of local_file
SnippetFinding.snippet.location=30-40
In the end this is not important.
This is the job of the Reporter
, or whatever is consuming the results, to build the information and collates everything back to [1-10, 20-30, 40-50] and [1-10, 30-40]. Building the intervals with all possible values.
Which is the only information that makes sense. A single snippet finding needs to be evaluated in context from the other snippet findings.
This is a bit dirty be but this won't require a model change.
Personally, I would still prefer a model change with a new TextRangeLocation
: there would be less snippets and they would all make sense individually.
But I can live with both solutions.
@sschuberth : As a temporary fix, I thought about storing the whole multi line range information in SnippetFinding.snippet.additionalData
under two keys: matchedLinesSource
and matchedLinesSnippet
.
Then, in SnippetFinding.sourceLocation
and SnippetFinding.snippet.location
we can store the first line range in case of multi line ranges.
In a way, this is fitting because FossID multiline ranges is scanner-specific data. What do you think about it, I stress it, as a temporary solution ?
I lost a bit track of this issue, but wanted to note that by now we have https://github.com/oss-review-toolkit/ort/pull/9024. Does that somehow help to resolve this issue, @nnobelis?
The "temporary" fix is still active, but only for the snippet matched lines (notice the firstOrNull
call):
In ScanOSS you addressed that by multiplying the SnippetFinding
s (some kind of scalar product).
So yes, we still need a better solution.
Currently, for performance reasons (see https://github.com/oss-review-toolkit/ort/issues/7028), the matched lines are not fetched from FossID.
However it seems the current snippet model is not capable of representing certain FossID responses.
For instance, for a given snippet match in FossID, I receive the following answer for the listMatchedLines function (truncated for clarity):
local_file
means this is the lines of the source file being matched. If one "compress" these lines into line ranges, the result is:1-24
and45-675
.Now this information is supposed to go in
SnippetFinding.sourceLocation
: https://github.com/oss-review-toolkit/ort/blob/6f6e91759730ec20dc172b60612d5d76ab35c232/model/src/main/kotlin/SnippetFinding.kt#L31Unfortunately,
TextLocation
in ORT can carry only two integers for the line information.So what can be done ?
SnippetFinding
, to represent the two ranges ? This could get messy for more complicated matches.TextLocation
for the snippet finding byTextRangeLocation
, a new class that can specify multiple range of lines ?TextLocation
and ignore the rest (or store it in another property)?As a side note, ScanOSS seems to deliver a single range:
but this is only an assumption, as I couldn't reproduce more complex cases of matching.