Various ideas for performance improvements (in the analyzer and scanner mainly)

Martin-Idel-SI commented 3 years ago

Possible improvements on ORT side of things

Analyzer:

Caching of analyzer results (also see https://github.com/oss-review-toolkit/ort/issues/5186)
~Make gradle/maven analyzer faster (don't download everything)~ https://github.com/oss-review-toolkit/ort/pull/3445
RAM-problems with certain builds: We had problems with the build needing large amounts of RAM for android projects

Scanner:

Issue with RAM in Scanner: For some projects with many dependencies, the scanning step ran out of RAM
~Check and reopen database connection: The connection is currently opened once. If it breaks due to random networking errors it will never be retried.~ https://github.com/oss-review-toolkit/ort/pull/3455
~Bulk download of scanner results: I have seen a build where only downloading and checking scanner results already takes about an hour.~ https://github.com/oss-review-toolkit/ort/pull/3704
~Split scanner results to enable parallel runs~ (comes with https://github.com/oss-review-toolkit/ort/pull/5799)
Redo scan on internal error: We have seen random failures in ScanCode which went away when trying again. Maybe retry?
Safe-guards to not take too much RAM? CPU is currently largely self-assigned, there is some possibility to set RAM usage, if I see it correctly. It would be nice to clearly define these things so that ORT uses the maximum resources we can assign.
Enable more databases: When using Azure, CosmosDB will probably allow better scaling and UI, other databases might be interesting. (also see https://github.com/oss-review-toolkit/ort/issues/4362)

Possible improvements on the infrastructure side of things

Automatically spawn agents for scanning/analyzing internet dependencies and distribute work among them
Check how to eliminate rate-limiting as much as possible: Github for instance has rate limiting - for unauthenticated users this is a problem. We need to make sure that authentication is always used when possible.
Memory/CPU safeguards to not starve agents

Martin-Idel-SI commented 3 years ago

Sorry, I didn't intend to open this here just now.

sschuberth commented 3 years ago

@mnonnenmacher you might still want to check this list to see whether there are overlaps with the "performance story" you plan to be working on for ORT at HERE.

Martin-Idel-SI commented 3 years ago

In that case, I added some more details to make it easier to understand what I'd like to do :)

mnonnenmacher commented 3 years ago

We have these two items on our list as well, but not yet planned when to work on them. I would suggest we notify each other before starting work on one of those:

Caching of analyzer results
Make gradle/maven analyzer faster (don't download everything)

This we have planned to stat working on within the next two weeks:

Bulk download of scanner results
Option to Bootstrap Flutter, because it is rarely used but makes the Docker container a lot larger
It seems that sometimes scan results are not stored and libraries are scanned repeatedly.
We see a large number of failed download attempts from the scanner that are probably not necessary.

I'd be very interested in what your idea is to implement this, because I want to have something similar for an experimental web frontend I plan to publish soon:

Automatically spawn agents for scanning/analyzing internet dependencies and distribute work among them

It also has a variant of:

Split scanner results to enable parallel runs

And I'm using Exposed with a connection pool there and plan to port this to ORT as well which would cover:

Check and reopen database connection

Martin-Idel-SI commented 3 years ago

That sounds great, especially since a lot of similar things are on the roadmap! Storage and repeated scanning was something I recently noticed but wasn't sure that's what I actually saw, yet. Smaller docker images also seem like a really good idea.

As for more agents, I don't have completely fleshed out ideas, yet - but the preliminary is indeed to be able to split the analyzer/scanner into several runs.

I currently believe that it will be hard to have a catch-all solution for different architectures (Jenkins, Azure Pipelines, etc.), so one solution would be:

ORT has some way to "split up" and "merge" analyzer and scanner results.
There is a (pluggable) "scheduler" component which does the scheduling depending on the architecture - for instance the default scheduler could just spawn processes.
The scheduler then collects the results and proceeds with the next steps. This should probably be a different component (maybe "collector?") since different running platforms could have a mechanism to wait for all results coming in, so there is no need for an idle scheduler just waiting for completion of each job.

Then one could have a scheduler which, for instance, spawns Kubernetes jobs in an autoscaler cluster, maybe using something like https://www.nomadproject.io/ or https://keda.sh/ under the hood. Or a scheduler with preconfigured VMs sending workloads to this set of VMs. Or a scheduler which orchestrates schedulers to allow a hybrid solution with VMs and a container runtime, etc.

I hardly know the codebase right now, so I have absolutely no idea how hard it would be to implement this.

sschuberth commented 3 years ago

Just misc. comments / some ideas:

Caching of analyzer results

Maybe we could use the UploadResultToPostgresCommand as a basis for storing analyzer results, and then reusing them.

It seems that sometimes scan results are not stored and libraries are scanned repeatedly.

Maybe @oheger-bosch has already proposed a solution for this at https://github.com/oss-review-toolkit/ort/issues/3328.

And I'm using Exposed with a connection pool there and plan to port this to ORT as well which would cover:
* Check and reopen database connection

Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across https://github.com/JetBrains/Exposed/issues/127. How does Exposed work for you without JSON support?

mnonnenmacher commented 3 years ago

@Martin-Idel-SI : In the web frontend I mentioned earlier the concept is that there will be a central server that manages the scans, and separate instances of analyzer and scanner services which get job from the server. I haven't made any decision regarding the technology so far, but for example JMS queues could be used for that. It's very different to the "serverless" approach we use in the Jenkins pipelines though.

Caching of analyzer results

Maybe we could use the UploadResultToPostgresCommand as a basis for storing analyzer results, and then reusing them.

Yes, I thought that as well, but we might have different ideas about when cached results are used, so we should align on that. But we have no clear proposal so far, just the basic idea of making a hash of the definition files to check if a new run is required.

It seems that sometimes scan results are not stored and libraries are scanned repeatedly.

Maybe @oheger-bosch has already proposed a solution for this at #3328.

It's not related to the connection, for some reason sometimes the storage complains that the raw result is null even though there are detected licenses, but I need to investigate that further before I can say more about the underlying issue.

Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across JetBrains/Exposed#127. How does Exposed work for you without JSON support?

I have made a custom column type similar to what's discussed in the issue. It does not support any of the JSON specific operators, but that was not an issue so far.

sschuberth commented 3 years ago

Also @oheger-bosch was briefly looking at Exposed for using a data source / connection pooling, but came across JetBrains/Exposed#127. How does Exposed work for you without JSON support?

I have made a custom column type similar to what's discussed in the issue. It does not support any of the JSON specific operators, but that was not an issue so far.

Sounds good, as using Exposed would also solve https://github.com/oss-review-toolkit/ort/issues/3328 IIUC what @oheger-bosch found out (Exposed already seem to use a data source).

Martin-Idel-SI commented 3 years ago

@mnonnenmacher I'm fine with a central server (definitely if it simplifies implementation). That may produce some overhead in terms of how many agents we need, but if all other components scale it shouldn't be a problem.

As for technology, it would be nice to make it swappable. JMS queues for example may be simple to use in one environment, but from what I have seen very challenging and too heavy-weight for most cloud environments (serverless or not). So if possible I'd like to make the implementation of the scheduling component swappable.

I guess we'll have to prototype and see what works and how well does it work.

sschuberth commented 1 year ago

I'm fine with a central server (definitely if it simplifies implementation).

Also see https://github.com/oss-review-toolkit/ort/issues/4688.

sschuberth commented 3 months ago

Possible improvements on the infrastructure side of things

These topics by now better fit to https://github.com/eclipse-apoapsis/ort-server.

As other topics have been mostly addressed, I'm closing this issue after the discussion with @oss-review-toolkit/core-devs.

oss-review-toolkit / ort

Various ideas for performance improvements (in the analyzer and scanner mainly) #3361