github / vscode-codeql

An extension for Visual Studio Code that adds rich language support for CodeQL
https://marketplace.visualstudio.com/items?itemName=GitHub.vscode-codeql
MIT License
426 stars 188 forks source link

Importing certain databases from zip are extremely slow #776

Open aeisenberg opened 3 years ago

aeisenberg commented 3 years ago

I can't share the database because it is private, but this database is less than 1GB zipped and about 3.4GB unzipped. Importing this database took over 1.5 hours to completion. I think the problem has to do with the fact that there are over 100,000 source files. First, these files need to be unzipped and placed in a directory. Then we need to re-zip them into a src.zip file. Our current unzipper library uses streaming, but it does read each file into memory separately.

It's possible that the slowness is exacerbated by the fix here: https://github.com/github/vscode-codeql/issues/622. Rather than read the zip in a single pass, we read the zips central directory and then grab each file based on what we find there.

I implemented this fix because some archives do not have correct file headers. The central directory is the source of truth and it exists at the end of the file. This happens when a zip file is updated after it is created.

Most of the time reading the file headers will be correct (except when they aren't). And that is likely faster than the central directory approach, especially when there are lots of small files. So, one possible solution would be to try reading via the file headers first and if that fails, fall back to the central directory.

adityasharad commented 3 years ago

This sounds reasonable. One way to keep stress-testing this would be to generate a test database with a large number of small JS files, and make sure the import function can handle them.

jbj commented 3 years ago

The problematic database is created by odasa rather than codeql. Therefore(?) it contains the sources directly instead of containing a src.zip. Do our current tools create databases without src.zip? If not, I don't see a strong reason to fix this issue.

aeisenberg commented 3 years ago

No, they don't. IIRC, about haalf of the import process here was taken up by re-zipping. Even with that, 45 minutes to unzip a database is too long, especially when it is faster on the CLI.

jbj commented 3 years ago

In any case, I haven't seen these performance problems with similarly-sized databases that have a src.zip.