repology / repology-updater

Repology backend service to update repository and package data
https://repology.org
GNU General Public License v3.0
502 stars 177 forks source link

openbsd updater needs adapting for model changes in sqlports ? #1349

Closed landryb closed 1 year ago

landryb commented 1 year ago

some recent changes made it to sqlports (cf https://github.com/openbsd/ports/commits/master/databases/sqlports), which might have broken the openbsd updater in repology according to @ajacoutot. @marcespie might know better, i'm only the messenger and havent looked at the actual changes nor potential breakage.

right now sqlports doesnt have a "stable api" but https://marc.info/?l=openbsd-ports&m=169419062712772&w=2 hints that there might be one someday.

marcespie commented 1 year ago

I've had a quick look at the parser, the DISTFILES and MASTER_SITES* fields you are using are indeed gone.

We've changed the layout in bsd.port.mk, so that we can have arbitrary DISTFILES.sufx variables, and SITES.sufx that match (also taken the opportunity to shorten the name). Accordingly, there's no longer a fixed list of MASTER_SITES names, thus it's gone from the main table.

I don't know what you use DISTFILES and MASTER_SITES for, but at a guess, you might want to use the "simpler" ROACH_URL and ROACH_SITES variables instead.

On complicated ports with multiple origins, these now should contain the main url that we fetch for the package being built, and the corresponding originating sites (ROACH_URL has all the "fun" parts of OpenBSD-style MASTER_SITES already parsed).

That should actually make the data you report more accurate, hopefully.

If you really need to, you can instead find the full data in subsidiary tables: _sites contain all the sites for each package path, N is actually the suffix used. _distfiles contains all the various files that get fetched (but the path{url}sufx transformation was not applied), where type is 0/1/2 depending whether it's a DISTFILES, PATCHFILES, or SUPDISTFILES.

AMDmi3 commented 1 year ago

Thanks for the heads up. I'll try to update the parser this week, until then I have to disable openbsd repo.

didickman commented 1 year ago

I use repology for openbsd very regularly so losing it on repology isn’t ideal. Instead of taking down the entire thing, would it be possible to just show the main master site as a quick stop gap for now? I think that will show the same as before for about 93% of ports as most ports don’t need more than one master site. Example diff here: repology_diff

AMDmi3 commented 1 year ago

I'm working on this. For now I came up with this query to return all site+distfile pairs for all ports, which can then be converted to real URLs. Does it look right? I'm particularly puzzled with _distfiles' SUFX vs N but it seems to work with ports using either.

SELECT
    _distfiles.FullPkgPath AS FullPkgPath,
    _sites.Value AS Sites,
    _fetchfiles.Value AS Files
FROM _distfiles
    JOIN _fetchfiles
        ON KeyRef=_Distfiles.Value
    JOIN _sites
        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
didickman commented 1 year ago

What does repology need exactly? The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page

The database has a table called Meta which might be useful to check for freshness and schema versioning.

sqlite> select * from Meta;
SchemaVersion  Hash                                          CreationDate
-------------  --------------------------------------------  ------------
7.48           e9fy133wnJkyctyUUC3P8MtkwdSShLQHLX+Plp6DZqo=  2023-09-17

Once that's known, the raw tables are the ones starting with underscores. So for example I see repology is using the raw tables today. Is that what you want to do?

Then there are some convenience views which merge the data in the raw tables. Those convenience views do not start with underscores.

The "Ports" view may be very useful to use. But since it's a view it's slow, so the results from the dynamic view are also stored as static data in "PortsQ" which breaks the pattern and is actually a table.

I'm not an expert on sqlports, but if you can share how you need the data for repology I can try to work it out with you. @marcespie is the real expert here. Maybe the PortsQ table is what Marc was suggesting for you to use when he mentioned using the ROACH_SITES and ROACH_URL variables? Here is an example:

sqlite> select fullpkgpath, roach_sites, ROACH_URL from portsq limit 1;
FullPkgPath        ROACH_SITES                                                  ROACH_URL
-----------------  -----------------------------------------------------------  --------------------
archivers/ancient  https://github.com/temisu/ancient/archive/refs/tags/v2.1.1/  ancient-2.1.1.tar.gz

Nevertheless if you can describe what repology needs it might be helpful for those of us on the OpenBSD side to try to help.

marcespie commented 1 year ago

On Mon, Sep 18, 2023 at 07:19:43PM -0700, Dmitry Marakasov wrote:

I'm working on this. For now I came up with this query to return all site+distfile pairs for all ports, which can then be converted to real URLs. Does it look right? I'm particularly puzzled with _distfiles' SUFX vs N but it seems to work with ports using either.

   SELECT
        _distfiles.FullPkgPath AS FullPkgPath,
        _sites.Value AS Sites,
        _fetchfiles.Value AS Files
    FROM _distfiles
        JOIN _fetchfiles
            ON KeyRef=_Distfiles.Value
        JOIN _sites
            ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)

N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.

As I mentioned MUCH EARLIER in the thread, unless you really want the full list of all files which are grabbed by the port, in the new scheme of things, THE MAIN DISTFILE for a given port will be located at ROACH_URL, grabbable from ROACH_SITE.

This is your best guess at figuring out whether a port is up-to-date (more or less), unless you get into stuff which releases as main files along with mammoth patches. Stuff like vim, for instance.

Note that this is also the data that other tools like portroach are supposed to use.

By just grabbing this info, instead of painting the shed red, you will get OpenBSD to >99% functionality again.

AMDmi3 commented 1 year ago

What does repology need exactly?

https://repology.org/docs/requirements TLDR: identifiers/names (portname, pkgname, origin), version, summary, maintainer, categories, licenses, cpe info, homepage and download urls. In the future, depends.

The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page

Thank you, but it contains only the basic info which is already clear from the database structure. It doesn't answer my question regarding SUFX vs N, neither does basic description you've given.

AMDmi3 commented 1 year ago

N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.

As I understand, that is _distfiles.Type, not N.

Back to N and SUFX, if my guess is right, this is an example of distfiles-sites linkage via SUFX:

sqlite> select * from _distfiles where fullpkgpath=4448;
FullPkgPath|Value|N|SUFX|Type
4448|18545||.github|0
4448|18546||.github|0
sqlite> select * from _sites where fullpkgpath=4448;
FullPkgPath|N|Value
4448|.github|https://github.com/
4448|.gitlab|https://gitlab.com/
4448|.srht|https://git.sr.ht/

And this is and example of linkage via N:

sqlite> select * from _distfiles where fullpkgpath=13146;
FullPkgPath|Value|N|SUFX|Type
13146|131315|||0
13146|131316|0||0
13146|131317|1||0
13146|131318|2||0
sqlite> select * from _sites where fullpkgpath=13146;
FullPkgPath|N|Value
13146||https://github.com/stumpwm/stumpwm/archive/
13146|0|https://github.com/edicl/cl-ppcre/archive/
13146|1|https://beta.quicklisp.org/archive/clx/2022-07-07/
13146|2|https://beta.quicklisp.org/archive/alexandria/2022-07-07/

coalesce as in the query above should handle both cases, am I right?

As I mentioned MUCH EARLIER in the thread, unless you really want the full list of all files which are grabbed by the port, in the new scheme of things, THE MAIN DISTFILE for a given port will be located at ROACH_URL, grabbable from ROACH_SITE.

I would have already implemented ROACH_* support much earlier if that was suitable, wouldn't I? Yes, I need the full list of all files.

marcespie commented 1 year ago

On Tue, Sep 19, 2023 at 05:13:55AM -0700, Dmitry Marakasov wrote:

N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.

As I understand, that is _distfiles.Type, not N.

Back to N and SUFX, if my guess is right, this is an example of distfiles-sites linkage via SUFX:

sqlite> select * from _distfiles where fullpkgpath=4448;
FullPkgPath|Value|N|SUFX|Type
4448|18545||.github|0
4448|18546||.github|0
sqlite> select * from _sites where fullpkgpath=4448;
FullPkgPath|N|Value
4448|.github|https://github.com/
4448|.gitlab|https://gitlab.com/
4448|.srht|https://git.sr.ht/

And this is and example of linkage via N:

sqlite> select * from _distfiles where fullpkgpath=13146;
FullPkgPath|Value|N|SUFX|Type
13146|131315|||0
13146|131316|0||0
13146|131317|1||0
13146|131318|2||0
sqlite> select * from _sites where fullpkgpath=13146;
FullPkgPath|N|Value
13146||https://github.com/stumpwm/stumpwm/archive/
13146|0|https://github.com/edicl/cl-ppcre/archive/
13146|1|https://beta.quicklisp.org/archive/clx/2022-07-07/
13146|2|https://beta.quicklisp.org/archive/alexandria/2022-07-07/

hence, coalesce in the query above should, which should handle both cases. So am I right?

Ah right, I had entirely forgotten about that part. The "N" part is transient.

It's there so that we know about stuff where we haven't converted from the old way to the new way (it still records the old DISTFILES = somefile:0 -> MASTER_SITES0 as opposed to the new DISTFILES.sufx = somefile -> SITES.sufx.

Don't spend too much time on that field, it will go away as soon as I find time to finish a full sweep of the ports tree :)

In fact, even if you "miss" out on a few files, I would advise not even trying to support it. That part of the schema will go away entirely as soon as I find the time to finish the sweep.

marcespie commented 1 year ago

As forecasted, I just finished removing MASTER_SITES0..9

So you don't need to worry about N, it's gone in sqlports 7.49

didickman commented 1 year ago

Fantastic! Thank you so much for the quick adaptation!

Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.

diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index 076ad7ac..5962b880 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -95,7 +95,6 @@ def _iter_sqlports(path: str) -> Iterator[Port]:
         yield Port(**row_dict, distfiles_cursor=distfiles_cursor)

-# TODO: drop _distfiles.N for sqlports >= 7.49
 _DISTFILES_QUERY = """
 SELECT
     _sites.Value AS sites,
@@ -104,7 +103,7 @@ FROM _distfiles
     JOIN _fetchfiles
         ON KeyRef=_Distfiles.Value
     LEFT JOIN _sites
-        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
+        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (_distfiles.SUFX is not distinct from _sites.N)
 WHERE
     _distfiles.FullPkgPath = ?
 """
marcespie commented 1 year ago

Following-up (somewhat): in the more generic case of all package distributions, we all have different rules about splitting up things and how "unique" things get.

I think I might get some students (!!!!) looking at it this year.

For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).

Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.

I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.

AMDmi3 commented 1 year ago

Fantastic! Thank you so much for the quick adaptation!

I won't count that as quick, but you're welcome)

Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.

Yesterday sqlports was still at 7.48 so couldn't drop it right away.

marcespie commented 1 year ago

On Tue, Sep 26, 2023 at 07:46:21AM -0700, Dmitry Marakasov wrote:

Fantastic! Thank you so much for the quick adaptation!

I won't count that as quick, but you're welcome)

Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.

Yesterday sqlports was still at 7.48 so couldn't drop it right away.

Depending on our snaps, you'll soon get 7.49. after all, it still takes a day to run a full bulk...

AMDmi3 commented 1 year ago

For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).

Repology primarily counts projects, which is specifically "number of supported software", so apart from a few exceptions subpackages and splits does not count. For instance, despite libxml2 is packaged

CentOS 6:

  pname  | version 
---------+---------
 libxml2 | 2.7.6

ConanCenter:

   pname  | version 
---------+---------
  libxml2 | 2.10.3
  libxml2 | 2.10.4
  libxml2 | 2.11.3
  libxml2 | 2.11.4
  libxml2 | 2.9.10
  libxml2 | 2.9.12
  libxml2 | 2.9.13
  libxml2 | 2.9.14
  libxml2 | 2.9.9

Cygwin:

          pname          | version 
------------------------+---------
  libxml2                | 2.10.4
  libxml2-debuginfo      | 2.10.4
  libxml2-devel          | 2.10.4
  libxml2-doc            | 2.10.4
  python27-libxml2       | 2.10.4
  python36-libxml2       | 2.10.4
  python37-libxml2       | 2.10.4
  python38-libxml2       | 2.10.4
  python39-libxml2       | 2.10.4
  mingw64-i686-libxml2   | 2.9.4
  mingw64-x86_64-libxml2 | 2.9.4

it's counted once for each repo. Of course there are discrepancies, and you can estimate the error rate by percentage of "unique" projects (not matched with any other repository), which is usually below 15%, or by percentage of unique projects which have "related" ones (with matches by url, which mean they can potentialy be merged with something) which is usually below 10%.

Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.

I don't have a problem with it. There's minor problem with a lot of flavors which Repology cannot deduplicate yet (thus showing a bunch of similar entries on e.g. package versions page), but that does not affect statistics and statuses and will be fixed on Repology side someday.

I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.

As said, I don't feel much need to. From what I can see, most unmatched projects in OpenBSD are truly unique or have ambiguous names and have unmatched counterparts in other repos, so there's no OpenBSD specific problems here. Still, it can be improved somewhat by submitting rules to repology or renaming packages in OpenBSD where that makes sense. For instance, I see some potentially misnamed python modules, like devel/py-test-expect may be better called devel/py-pytest-expect to match PyPI.

Depending on our snaps, you'll soon get 7.49. after all, it still takes a day to run a full bulk...

Yes, it's already there and I've deployed updated parser.

marcespie commented 1 year ago

Renaming packages is a bitch. We got a mechanism for that, but it is fairly expensive on the updater-side.

AMDmi3 commented 1 year ago

Well, just suggesting, it's not fatal as it is.

didickman commented 1 year ago

Looking at some false positives on OpenBSD, I noticed repology says security/ghidra is out of date.

However that port is marked BROKEN on all archs right now. So it is not built.

Therefore does the below change make sense? It filters out ports that are broken on ALL archs.

This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD. (in fact the security issues were noted as part of the reason it was disabled. see: marc.info

This diff purposely does not filter out ports that are only broken on SOME archs.

diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index e88fcc33..6357d297 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -58,6 +58,8 @@ FROM _Ports
         ON Categories_ordered.FullPkgpath=_Ports.FullPkgpath
     JOIN _Email
         ON _Email.KeyRef=MAINTAINER
+WHERE
+    _Ports.FullPkgPath NOT IN (SELECT DISTINCT FullPkgPath FROM _Broken where Arch IS NULL)
 """
didickman commented 1 year ago

Another possible false positive.

I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability.

Blindly guessing I see that we have CVE-2023-40217 which states that some affected versions include "[3.10.0, 3.10.13)". To my knowledge a square bracket on the right would mean version <= 3.10.13, while a parenthesis means version < 3.10.13. i.e. strictly less than version 3.10.13.

Is it possible repology isn't marking the affected versions correctly? Or did I miss a vulnerability that's in Python 3.10.13 on my side?

AMDmi3 commented 1 year ago

This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD

This makes sense, but I don't feel good about hiding information. Repology's target audience is not only maintainers but also upstream, and visibility of broken packages is especially important for these. Also there are maintainers from other repositories who may find even disabled package useful. Showing it as outdated is also a valid call for action. The point that its vulnerability does not in fact affect users because the package is not available is valid though, and it would be fair to at least exclude it from count of vulnerable packages. This requires a bit of development though (#1352).

I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability

You can click an exclamation mark to get a page which highlights CVEs matched with a given version: https://repology.org/project/python/cves?version=3.10.13

marcespie commented 1 year ago

On Wed, Sep 27, 2023 at 02:53:34PM -0700, Dmitry Marakasov wrote:

This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD

This makes sense, but I don't feel good about hiding information. Repology's target audience is not only maintainers but also upstream, and visibility of broken packages is especially important for these. Also there are maintainers from other repositories who may find even disabled package useful. Showing it as outdated is also a valid call for action. The point that its vulnerability does not in fact affect users because the package is not available is valid though, and it would be fair to at least exclude it from count of vulnerable packages. This requires a bit of development though (#1352).

Our ghidra is totally out-of-date. There are few OpenBSD developers who care about updating java software.

I have zero idea about the status of our patches to make ghidra run, whether they were committed to upstream (apparently there's at least an issue).

From what I can see, it looks like somewhat poorly architectured software that actually wants lists and lists of OSes instead of doing somewhat OO stuff... well, not that I'm surprised considering the usual verbosity of java.

Anyhow, if someone is interested enough in running a recent ghidra on OpenBSD, they're welcome to try and update the port and submit patches.