thkukuk / rpm2docserv

Extract manual pages and html docu from RPMs and build static webpages from it
Apache License 2.0
10 stars 4 forks source link

Optimize speed when gathering RPMs #7

Open thkukuk opened 1 year ago

thkukuk commented 1 year ago

Going through all RPMs and look for manual pages take a very long time, try to optimize that.

hellcp commented 1 year ago

You can parse filelists.xml.gz file that's published with every repository under /repodata (for example in Tumbleweed OSS repo). You may want to parse repomd.xml file to get the current version of the filelists. That should be way faster than downloading RPMs you aren't sure contain any manpages

thkukuk commented 1 year ago

You can parse filelists.xml.gz file that's published with every repository under /repodata (for example in Tumbleweed OSS repo). You may want to parse repomd.xml file to get the current version of the filelists. That should be way faster than downloading RPMs you aren't sure contain any manpages

Parsing repo data only works if you can use a full product repo, like for openSUSE Tumbleweed. This does not work if you want to build e.g. a MicroOS container with the manual pages, as here zypper is needed to resolve the dependencies and fetch the correct RPMs. So you have only a directory of RPMs without fitting meta data as input.

We can handle this as two different cases, use the repo data if they exist else use a fallback and build them at our own. Still means that somebody has to implement a repomd parser in golang. I couldn't find anything in this direction (the idea to parse the repomd data is not new)

Between, in cases like openSUSE Tumbleweed we don't download RPMs, we have local access to a full repository. But being able to use the repomd data would still save us a huge amount of "rpm -qlp ..." calls.

hellcp commented 1 year ago

You could use the resolver to spit out the package list, then filter the packages you certainly do not need out with parsing the filelist I guess. No idea if downloading is this much of a bottleneck here