probonopd / linuxdeployqt

Makes Linux applications self-contained by copying in the libraries and plugins that the application uses, and optionally generates an AppImage. Can be used for Qt and other applications
Other
2.22k stars 414 forks source link

Improve copyCopyrightFile #231

Open TheAssassin opened 6 years ago

TheAssassin commented 6 years ago

While reviewing linuxdeployqt, I could confirm that copyCopyrightFile() is one of the performance bottlenecks (takes 4-5 seconds, measuring the time of a few calls while watching linuxdeployqt's output with -verbose=3).

Now, to understand my idea, I'd like to quickly summarize how copyCopyrightFile() works. The function is called on every file copying process, so it should be as efficient as possible, given the amount of files that need to be copied into an AppDir.

The function first performs some, let's say, "platform checks", i.e., whether dpkg and dpkg-query are available on the system. These checks are performed on every call. I am not sure how much time those calls consume, but I'd say we can safely assume that if they're unavailable on the first call, this wouldn't change during the rest of the runtime, and vice versa. Therefore, I'd only perform these checks on the first call, and would then skip them (pretty much like "lazy initialization", I'd add a global static variable to keep this state).

Next, the app queries the package that contains the file we're looking for a copyright file for, and then lists that package's contents (a list of paths). It then filters this lists with a list of given patterns (which I don't know about how it was compiled, @probonopd please add a comment about it, as far as I can see, it's been either you looking into a bunch of packets and where they stored this kind of data, or there's a Debian maintainer standard or document somewhere on the internet specifying those locations). These files are then copied into the AppDir.

I didn't get to profiling this with a sufficient granularity to see which of these calls is causing the performance bottleneck on my computer, but I can definitely blame copyCopyrightFile(). Since RAM is cheap nowadays, I want to suggest a method that optimizes subsequent calls to copyCopyrightFile() when files have been packaged in the same packages. I would like to save all paths of packages whose copyright files have been copied already in a suitable data structure (i.e., the files returned by the second query to dpkg). Whenever a file is contained in this data structure already, we can safely assume that its copyright files have been copied already.

Of course, that will increase the memory footprint a bit, but I could imagine that this saves quite some amount of calls to dpkg(-query). For the beginning, a simple sorted list combined with a binary search should do, allowing for quick lookups. We could further improve this by sorting the data e.g., into a tree, simulating the file system hierarchy, although I could imagine that the files we copy are in similar directories, hence requiring a binary search in the leafs, too, so I don't think that such a tree would have any positive impact on the lookup durations.

If you want some numbers comparing before and after results, I can start with some benchmarking, to e able to see whether there's any impact on the overall performance.

And no, I own an SSD, it's probably not an IO bottleneck on my end causing dpkg(-query) to be this slow.

probonopd commented 6 years ago

You are right, the code is not optimized for speed in any way. What is interesting is that you seem to experience major slowdowns while I didn't notice these. But anyway, you are right this could be improved. As could the overall performance of linuxdeployqt, since we are processing the same files over and over again. I think this is where the major performance bottleneck is. Thing is, I don't have the time to optimize this, since the end result is working at the moment, I can live with it.

If someone wants to really improve performance, I'd suggest to open a new branch and essentially do a major refactoring/rewrite, ensuring that each file that is deployed into the AppDir is only operated on once. But please don't attempt to do this in the main branch, which is following the macdeployqt program logic.

As for the regex, I think I found it by trial and error.

TheAssassin commented 6 years ago

I could imagine a system like routing algorithms use them. Basically, you maintain two lists of "nodes" in a map. One contains the nodes that need to be regarded, the other one the which have been "visited" already (and hence don't need to be worked on any more). This is pretty efficient in my experience, and I could imagine implementing such a system here, too. However, we really need to refactor the code for this. Maybe I can work on something in one the the next sprints?

TheAssassin commented 6 years ago

As for the regex, I think I found it by trial and error.

What regex are you talking about?

probonopd commented 6 years ago

Regex: Must have had the wrong ticket in mind; nevermind.

denis-shienkov commented 6 years ago

Hi guys. This spent ~15 minuts for me for one application(please see #248). But my project contains more than one application... So, common time will be increased up to 1-2 hour, as minimum.. It is very very bad.. ((

denis-shienkov commented 6 years ago

Is macdeployqt soo slow too?

TheAssassin commented 6 years ago

@denis-shienkov the tools can't really be compared. All I can tell is macdeployqt doesn't use this copyright copying stuff. Also, it's a Debian/Ubuntu specific feature.

denis-shienkov commented 6 years ago

))))))

denis-shienkov commented 6 years ago

oops, double shot )))

denis-shienkov commented 6 years ago

Many thanks, this new '-no-copy-copyright-files' option helped, it decreases time from ~15 minutes up to ~15 seconds (in my case).

probonopd commented 6 years ago

Wow. I wonder why the impact of this can go from virtually non-existent to huuuuge. Are you running your builds on a system that has been set up long time ago, and has seen many apt-get operations? (I am trying to find out why the impact is so large on your system but virtually negligible on Travis CI.)

TheAssassin commented 6 years ago

CC @azubieta. I know you maintain your own patched linuxdeployqt because of these issues. Please verify whether you can use the upstream linuxdeployqt with this flag instead.

TheAssassin commented 6 years ago

@probonopd I already told you it's dpkg-query being super slow...

probonopd commented 6 years ago

...and be sure to fulfill all requirements of licenses such as the GPL...

probonopd commented 6 years ago

I already told you it's dpkg-query being super slow...

...on some systems but apparently not on others (=the one I happen to use)

TheAssassin commented 6 years ago

The one you use is a live ISO without any customizations, no PPAs, and running from RAM.

We need to fix the architectural issues first in order to fix issues like these. According to you, linuxdeployqt touches the same file more than once, etc...

probonopd commented 6 years ago

Correct