go-enry / go-license-detector

Reliable project licenses detector.
Other
127 stars 36 forks source link

Huge performance degradation with large license candidate files due to a bug #31

Open SamiHiltunen opened 1 year ago

SamiHiltunen commented 1 year ago

The library is building a regex here of the normalized first lines of license files. It then later splits files using the regex here.

The problem here is that the App-s2p.txt license's first line normalizes into an empty string. This then causes the regex to match every line beginning and ending as we can see for example in this regex tester. You can see the bug in the regex by searching for || which is where the license's first line would go.

This causes huge performance degradation in repositories with large files that match the license filename pattern. One example of a such a repository is https://gitlab.com/tikiwiki/tiki which contains a large file called copyright.txt. Detecting a license for the repository took 22s. Detecting the license takes 260ms with the below patch:

diff --git a/licensedb/internal/db.go b/licensedb/internal/db.go
index a7254fd..d69118e 100644
--- a/licensedb/internal/db.go
+++ b/licensedb/internal/db.go
@@ -176,6 +176,11 @@ func loadLicenses() *database {
        if len(header.Name) <= 6 {
            continue
        }
+
+       if header.Name == "./App-s2p.txt" {
+           continue
+       }
+
        key := header.Name[2 : len(header.Name)-4]
        text := make([]byte, header.Size)
        readSize, readErr := archive.Read(text)

What would be the appropriate fix here?