Open sschuberth opened 4 months ago
What we have missed to discuss so far, but I believe is important to clarify is: Project.vcs
/ Project.vcsProcessed
.
Previously, one could always set it. The feature implemented here should probably ensure we can still always set that property. So, its datatype would need to change?!
Good point about a Project
's provenance info. If the analysis is not performed on a repository, it also does not make sense to store VCS-related information for the Project
s, and these should probably be migrated to KnownProvenance
as well.
This relates a bit to @mnonnenmacher's questions about how follow-up ORT tools, that might run on a different machine / node, should get access to the provenance's / project's source code.
Would you like me to incorporate those changes to Project
into #8764 as well before we proceed?
Would you like me to incorporate those changes to
Project
into #8764 as well before we proceed?
I'm not sure. Your PR is already quite involving, so maybe we should not make it yet more complex, but roll out the changes in stages, if possible.
Yes, I see the problem: either we have breaking changes after every small pull request, or we have a large pull request, which is impossible to review.
Maybe we could create a feature branch for this topic and merge any PRs onto there first and only once it is finished you merge it into main? This way there would only be breaking changes in one release and you also had small PRs to review.
FYI: I have taken a look at the changes required for Project
and they appear to be very similar to what I already did in Repository
, so I believe adding those on top would be straight forward. It would just takes a bit more work.
@sschuberth Is there any news from the core developer meeting?
@sschuberth Is there any news from the core developer meeting?
No, because we had to skip it for this week due to appointment conflicts, sorry.
Today, we had a meeting with @fviernau and discussed potential issues regarding the current refactoring approach.
VcsInfo
sourceWhile the vcs
from Project
is said to be defined in the project's "metadata" ¹,
https://github.com/oss-review-toolkit/ort/blob/e074893cd6f0c711e206c49aa8cf31aa05f88285/model/src/main/kotlin/Project.kt#L79
but the vcs
from Repository
originates from the analyzer root in the working tree.
https://github.com/oss-review-toolkit/ort/blob/f3123a769ff0242b7f6e4131f1c60ceade55333b/model/src/main/kotlin/Repository.kt#L32
Frank pointed out this ambiguity in the kdoc strings, which may have lead to inconsistent implementation regarding the source of the VcsInfo
objects.
Further, any LocalFileProvenance
would lack metadata and would not fit the given description. Frank suggests, the semantics of the VcsInfo
s source should be clarified, before further refactoring measures are taken.
KnownProvenance
sub-classesFrank suggested to devide the KnownProvenance
class into two distinct sub-classes RemoteProvenance
and LocalProvenance
. This way a feature would be able to support RepositoryProvenance
and ArtifactProvenance
without the requirement to support all KnownProvenance
s including the new LocalProvance
s.
I believe that approach to be sensible. For one it separates the Provenance
s logically. It also allows granular control of which provenances are supported when, allowing distinction if necessary.
A scheme for such an inheritance could look like this:
VersionControlSystem
PluginWe also briefly talked about the alternative solution, @sschuberth already suggested before. Although it was not Frank's area of expertise, he assumes such an implementation would be less complex, as it only requires implementation of a single class with little to no interfaces to the rest of the code base.
1. Ambiguous
VcsInfo
source
I'm not sure I follow here about what the "ambiguity" is.
Let me start by saying that I believe the properties to be documented correctly regarding their intent. That is, IMO the Project
class basically is a common model for the data in definition files from all supported package managers.
So for example, the vcs
property is supposed to contain the data from a Maven project pom.yml
's <scm>
tag. However, data is not taken 1:1, but package-manager-specific syntax is resolved, like Maven's scm:git:...
syntax of NPM's shortcut syntax.
On top of that, the vcsProcessed
property is supposed to contain further normalized data, e.g. by mapping scp
-type Git URLs consistently to full ssh
URLs.
That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have processProjectVcs()
which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.
The current Repository
class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.
2.
KnownProvenance
sub-classes
The proposed structure makes sense to me.
That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have
processProjectVcs()
which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.
Would that be something that needs to be fixed before our provenance implementation?
The current
Repository
class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.
Sounds like the two are only indirectly related.
Would that be something that needs to be fixed before our provenance implementation?
IMO, no.
Sounds like the two are only indirectly related.
IMO, yes.
Summarizing today's meeting between @sschuberth (ORT), @mnonnenmacher (ORT), @jens-erdmann (HELLA), and myself (HELLA).
TL:DR: We are still looking to contribute on the LocalProvenance
topic and the maintainers are still interested in seeing the contribution to fruition.
VcsInfo
sourceWith the absence of @fviernau, we could not conclude this topic. For now it should not be a blocker though and we hope we can resolve it once we get there.
KnownProvenance
sub-classesThe proposed hierarchy seems like a good approach, but the refactoring overhead still has to be investigated. Specifically it is unclear, if conditional statements differentiating between RemoteProvenance
s and LocalProvance
s would outweigh the conditional statements, which would be required to limit interfaces to VcsProvenance
and ArtifactProvenance
, which would be required without the Provenance Hierarchy Refactoring.
Additional reason for such a refactoring however would be the later support for other possible Provenance
s, such as ZipProvenance
or DockerProvenance
. In light of that possibility, we updated the name of Local
FileProvenance
to DirectoryProvenance
. The previous comment was amended.
We further surmised that if this task should be realized, it should be the first incremental contribution of the Local Provenance Refactoring. @sschuberth offered to support on this and perhaps implement this feature. He and I will stay in contact on this and I will contribute as he sees fit.
VersionControlSystem
PluginI have now implemented a "Dummy" VCS plugin, which we will probably use as a workaround for the use case of scanning non-vcs software projects for the time being.
This allows us to take our time with the potenial Provenance Hierarchy Refactoring and the Local Provenance Reafactoing as a whole.
We also discussed the possibility to release said plugin as a vcs plugin example, similar to ort-package-manager-plugin.
Once the Provenance Hierarchy is clear, I can begin to rebase existing refactoring branches and we can continue the implementation. Existing branches are:
VcsInfo
inside Repository
in a RepositoryProvenance
Project
class in same wayArtifactProvenance
within the Repository
class
As ORT takes transparency and reproducibility of results serious, currently only local directories that are under version control can serve as the input to the analyzer. This is because for such "working trees" it is machine-readable where they originate from, and also the state of the source code is encoded into the VCS revision. If a remote VCS repository is to be analyzed, it either needs to be manually cloned with the respective VCS tool or the ORT Downloader first.
However, there are use-cases where the source code to analyze never was checked into version control, and committing it to a "fake" repository just to be able to analyze it defeats the purpose of capturing genuine provenance information.
Thus the proposal is to relax / extend the valid input types for the analyzer to the following:
Open questions:
Related issues / PRs: