Closed Martin-Idel-SI closed 3 months ago
Sorry, I didn't intend to open this here just now.
@mnonnenmacher you might still want to check this list to see whether there are overlaps with the "performance story" you plan to be working on for ORT at HERE.
In that case, I added some more details to make it easier to understand what I'd like to do :)
We have these two items on our list as well, but not yet planned when to work on them. I would suggest we notify each other before starting work on one of those:
This we have planned to stat working on within the next two weeks:
I'd be very interested in what your idea is to implement this, because I want to have something similar for an experimental web frontend I plan to publish soon:
It also has a variant of:
And I'm using Exposed with a connection pool there and plan to port this to ORT as well which would cover:
That sounds great, especially since a lot of similar things are on the roadmap! Storage and repeated scanning was something I recently noticed but wasn't sure that's what I actually saw, yet. Smaller docker images also seem like a really good idea.
As for more agents, I don't have completely fleshed out ideas, yet - but the preliminary is indeed to be able to split the analyzer/scanner into several runs.
I currently believe that it will be hard to have a catch-all solution for different architectures (Jenkins, Azure Pipelines, etc.), so one solution would be:
Then one could have a scheduler which, for instance, spawns Kubernetes jobs in an autoscaler cluster, maybe using something like https://www.nomadproject.io/ or https://keda.sh/ under the hood. Or a scheduler with preconfigured VMs sending workloads to this set of VMs. Or a scheduler which orchestrates schedulers to allow a hybrid solution with VMs and a container runtime, etc.
I hardly know the codebase right now, so I have absolutely no idea how hard it would be to implement this.
Just misc. comments / some ideas:
Caching of analyzer results
Maybe we could use the UploadResultToPostgresCommand
as a basis for storing analyzer results, and then reusing them.
It seems that sometimes scan results are not stored and libraries are scanned repeatedly.
Maybe @oheger-bosch has already proposed a solution for this at https://github.com/oss-review-toolkit/ort/issues/3328.
And I'm using Exposed with a connection pool there and plan to port this to ORT as well which would cover:
* Check and reopen database connection
Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across https://github.com/JetBrains/Exposed/issues/127. How does Exposed work for you without JSON support?
@Martin-Idel-SI : In the web frontend I mentioned earlier the concept is that there will be a central server that manages the scans, and separate instances of analyzer and scanner services which get job from the server. I haven't made any decision regarding the technology so far, but for example JMS queues could be used for that. It's very different to the "serverless" approach we use in the Jenkins pipelines though.
Caching of analyzer results
Maybe we could use the
UploadResultToPostgresCommand
as a basis for storing analyzer results, and then reusing them.
Yes, I thought that as well, but we might have different ideas about when cached results are used, so we should align on that. But we have no clear proposal so far, just the basic idea of making a hash of the definition files to check if a new run is required.
It seems that sometimes scan results are not stored and libraries are scanned repeatedly.
Maybe @oheger-bosch has already proposed a solution for this at #3328.
It's not related to the connection, for some reason sometimes the storage complains that the raw result is null
even though there are detected licenses, but I need to investigate that further before I can say more about the underlying issue.
Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across JetBrains/Exposed#127. How does Exposed work for you without JSON support?
I have made a custom column type similar to what's discussed in the issue. It does not support any of the JSON specific operators, but that was not an issue so far.
Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across JetBrains/Exposed#127. How does Exposed work for you without JSON support?
I have made a custom column type similar to what's discussed in the issue. It does not support any of the JSON specific operators, but that was not an issue so far.
Sounds good, as using Exposed would also solve https://github.com/oss-review-toolkit/ort/issues/3328 IIUC what @oheger-bosch found out (Exposed already seem to use a data source).
@mnonnenmacher I'm fine with a central server (definitely if it simplifies implementation). That may produce some overhead in terms of how many agents we need, but if all other components scale it shouldn't be a problem.
As for technology, it would be nice to make it swappable. JMS queues for example may be simple to use in one environment, but from what I have seen very challenging and too heavy-weight for most cloud environments (serverless or not). So if possible I'd like to make the implementation of the scheduling component swappable.
I guess we'll have to prototype and see what works and how well does it work.
I'm fine with a central server (definitely if it simplifies implementation).
Also see https://github.com/oss-review-toolkit/ort/issues/4688.
Possible improvements on the infrastructure side of things
These topics by now better fit to https://github.com/eclipse-apoapsis/ort-server.
As other topics have been mostly addressed, I'm closing this issue after the discussion with @oss-review-toolkit/core-devs.
Possible improvements on ORT side of things
Analyzer:
Scanner:
Possible improvements on the infrastructure side of things