oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.61k stars 313 forks source link

Allow "non-repository" input to the analyzer #8803

Open sschuberth opened 4 months ago

sschuberth commented 4 months ago

As ORT takes transparency and reproducibility of results serious, currently only local directories that are under version control can serve as the input to the analyzer. This is because for such "working trees" it is machine-readable where they originate from, and also the state of the source code is encoded into the VCS revision. If a remote VCS repository is to be analyzed, it either needs to be manually cloned with the respective VCS tool or the ORT Downloader first.

However, there are use-cases where the source code to analyze never was checked into version control, and committing it to a "fake" repository just to be able to analyze it defeats the purpose of capturing genuine provenance information.

Thus the proposal is to relax / extend the valid input types for the analyzer to the following:

Open questions:

Related issues / PRs:

fviernau commented 4 months ago

What we have missed to discuss so far, but I believe is important to clarify is: Project.vcs / Project.vcsProcessed. Previously, one could always set it. The feature implemented here should probably ensure we can still always set that property. So, its datatype would need to change?!

sschuberth commented 4 months ago

Good point about a Project's provenance info. If the analysis is not performed on a repository, it also does not make sense to store VCS-related information for the Projects, and these should probably be migrated to KnownProvenance as well.

This relates a bit to @mnonnenmacher's questions about how follow-up ORT tools, that might run on a different machine / node, should get access to the provenance's / project's source code.

pepper-jk commented 4 months ago

Would you like me to incorporate those changes to Project into #8764 as well before we proceed?

sschuberth commented 4 months ago

Would you like me to incorporate those changes to Project into #8764 as well before we proceed?

I'm not sure. Your PR is already quite involving, so maybe we should not make it yet more complex, but roll out the changes in stages, if possible.

pepper-jk commented 4 months ago

Yes, I see the problem: either we have breaking changes after every small pull request, or we have a large pull request, which is impossible to review.

Maybe we could create a feature branch for this topic and merge any PRs onto there first and only once it is finished you merge it into main? This way there would only be breaking changes in one release and you also had small PRs to review.

FYI: I have taken a look at the changes required for Project and they appear to be very similar to what I already did in Repository, so I believe adding those on top would be straight forward. It would just takes a bit more work.

pepper-jk commented 4 months ago

@sschuberth Is there any news from the core developer meeting?

sschuberth commented 4 months ago

@sschuberth Is there any news from the core developer meeting?

No, because we had to skip it for this week due to appointment conflicts, sorry.

pepper-jk commented 4 months ago

Today, we had a meeting with @fviernau and discussed potential issues regarding the current refactoring approach.

1. Ambiguous VcsInfo source

While the vcs from Project is said to be defined in the project's "metadata" ¹, https://github.com/oss-review-toolkit/ort/blob/e074893cd6f0c711e206c49aa8cf31aa05f88285/model/src/main/kotlin/Project.kt#L79 but the vcs from Repository originates from the analyzer root in the working tree. https://github.com/oss-review-toolkit/ort/blob/f3123a769ff0242b7f6e4131f1c60ceade55333b/model/src/main/kotlin/Repository.kt#L32

Frank pointed out this ambiguity in the kdoc strings, which may have lead to inconsistent implementation regarding the source of the VcsInfo objects. Further, any LocalFileProvenance would lack metadata and would not fit the given description. Frank suggests, the semantics of the VcsInfos source should be clarified, before further refactoring measures are taken.

2. KnownProvenance sub-classes

Frank suggested to devide the KnownProvenance class into two distinct sub-classes RemoteProvenance and LocalProvenance. This way a feature would be able to support RepositoryProvenance and ArtifactProvenance without the requirement to support all KnownProvenances including the new LocalProvances.

I believe that approach to be sensible. For one it separates the Provenances logically. It also allows granular control of which provenances are supported when, allowing distinction if necessary.

A scheme for such an inheritance could look like this:

3. "Dummy" VersionControlSystem Plugin

We also briefly talked about the alternative solution, @sschuberth already suggested before. Although it was not Frank's area of expertise, he assumes such an implementation would be less complex, as it only requires implementation of a single class with little to no interfaces to the rest of the code base.

sschuberth commented 3 months ago

1. Ambiguous VcsInfo source

I'm not sure I follow here about what the "ambiguity" is.

Let me start by saying that I believe the properties to be documented correctly regarding their intent. That is, IMO the Project class basically is a common model for the data in definition files from all supported package managers.

So for example, the vcs property is supposed to contain the data from a Maven project pom.yml's <scm> tag. However, data is not taken 1:1, but package-manager-specific syntax is resolved, like Maven's scm:git:... syntax of NPM's shortcut syntax.

On top of that, the vcsProcessed property is supposed to contain further normalized data, e.g. by mapping scp-type Git URLs consistently to full ssh URLs.

That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have processProjectVcs() which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.

The current Repository class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.

2. KnownProvenance sub-classes

The proposed structure makes sense to me.

pepper-jk commented 3 months ago

That said, I believe the implementation of the above is partly wrong in some package managers. For example, we have processProjectVcs() which enriches "VCS information determined from the project's directory with VCS information determined from the project's metadata", mixing two potentially unrelated sources of VCS information.

Would that be something that needs to be fixed before our provenance implementation?

The current Repository class on the other hand is really only supposed to record the (repository) provenance of the input directory the analyzer was run on. It would also be present if no projects were found at all.

Sounds like the two are only indirectly related.

sschuberth commented 3 months ago

Would that be something that needs to be fixed before our provenance implementation?

IMO, no.

Sounds like the two are only indirectly related.

IMO, yes.

pepper-jk commented 2 months ago

Summarizing today's meeting between @sschuberth (ORT), @mnonnenmacher (ORT), @jens-erdmann (HELLA), and myself (HELLA).

TL:DR: We are still looking to contribute on the LocalProvenance topic and the maintainers are still interested in seeing the contribution to fruition.

1. Ambiguous VcsInfo source

With the absence of @fviernau, we could not conclude this topic. For now it should not be a blocker though and we hope we can resolve it once we get there.

2. KnownProvenance sub-classes

The proposed hierarchy seems like a good approach, but the refactoring overhead still has to be investigated. Specifically it is unclear, if conditional statements differentiating between RemoteProvenances and LocalProvances would outweigh the conditional statements, which would be required to limit interfaces to VcsProvenance and ArtifactProvenance, which would be required without the Provenance Hierarchy Refactoring.

Additional reason for such a refactoring however would be the later support for other possible Provenances, such as ZipProvenance or DockerProvenance. In light of that possibility, we updated the name of Local FileProvenance to DirectoryProvenance. The previous comment was amended.

We further surmised that if this task should be realized, it should be the first incremental contribution of the Local Provenance Refactoring. @sschuberth offered to support on this and perhaps implement this feature. He and I will stay in contact on this and I will contribute as he sees fit.

3. "Dummy" VersionControlSystem Plugin

I have now implemented a "Dummy" VCS plugin, which we will probably use as a workaround for the use case of scanning non-vcs software projects for the time being.

This allows us to take our time with the potenial Provenance Hierarchy Refactoring and the Local Provenance Reafactoing as a whole.

We also discussed the possibility to release said plugin as a vcs plugin example, similar to ort-package-manager-plugin.

4. Further Local Provenance Refactoring

Once the Provenance Hierarchy is clear, I can begin to rebase existing refactoring branches and we can continue the implementation. Existing branches are: