Closed landryb closed 1 year ago
I've had a quick look at the parser, the DISTFILES and MASTER_SITES* fields you are using are indeed gone.
We've changed the layout in bsd.port.mk, so that we can have arbitrary DISTFILES.sufx variables, and SITES.sufx that match (also taken the opportunity to shorten the name). Accordingly, there's no longer a fixed list of MASTER_SITES names, thus it's gone from the main table.
I don't know what you use DISTFILES and MASTER_SITES for, but at a guess, you might want to use the "simpler" ROACH_URL and ROACH_SITES variables instead.
On complicated ports with multiple origins, these now should contain the main url that we fetch for the package being built, and the corresponding originating sites (ROACH_URL has all the "fun" parts of OpenBSD-style MASTER_SITES already parsed).
That should actually make the data you report more accurate, hopefully.
If you really need to, you can instead find the full data in subsidiary tables: _sites contain all the sites for each package path, N is actually the suffix used. _distfiles contains all the various files that get fetched (but the path{url}sufx transformation was not applied), where type is 0/1/2 depending whether it's a DISTFILES, PATCHFILES, or SUPDISTFILES.
Thanks for the heads up. I'll try to update the parser this week, until then I have to disable openbsd repo.
I use repology for openbsd very regularly so losing it on repology isn’t ideal. Instead of taking down the entire thing, would it be possible to just show the main master site as a quick stop gap for now? I think that will show the same as before for about 93% of ports as most ports don’t need more than one master site. Example diff here: repology_diff
I'm working on this. For now I came up with this query to return all site+distfile pairs for all ports, which can then be converted to real URLs. Does it look right? I'm particularly puzzled with _distfiles' SUFX
vs N
but it seems to work with ports using either.
SELECT
_distfiles.FullPkgPath AS FullPkgPath,
_sites.Value AS Sites,
_fetchfiles.Value AS Files
FROM _distfiles
JOIN _fetchfiles
ON KeyRef=_Distfiles.Value
JOIN _sites
ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
What does repology need exactly? The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page
The database has a table called Meta which might be useful to check for freshness and schema versioning.
sqlite> select * from Meta;
SchemaVersion Hash CreationDate
------------- -------------------------------------------- ------------
7.48 e9fy133wnJkyctyUUC3P8MtkwdSShLQHLX+Plp6DZqo= 2023-09-17
Once that's known, the raw tables are the ones starting with underscores. So for example I see repology is using the raw tables today. Is that what you want to do?
Then there are some convenience views which merge the data in the raw tables. Those convenience views do not start with underscores.
The "Ports" view may be very useful to use. But since it's a view it's slow, so the results from the dynamic view are also stored as static data in "PortsQ" which breaks the pattern and is actually a table.
I'm not an expert on sqlports, but if you can share how you need the data for repology I can try to work it out with you. @marcespie is the real expert here. Maybe the PortsQ table is what Marc was suggesting for you to use when he mentioned using the ROACH_SITES and ROACH_URL variables? Here is an example:
sqlite> select fullpkgpath, roach_sites, ROACH_URL from portsq limit 1;
FullPkgPath ROACH_SITES ROACH_URL
----------------- ----------------------------------------------------------- --------------------
archivers/ancient https://github.com/temisu/ancient/archive/refs/tags/v2.1.1/ ancient-2.1.1.tar.gz
Nevertheless if you can describe what repology needs it might be helpful for those of us on the OpenBSD side to try to help.
On Mon, Sep 18, 2023 at 07:19:43PM -0700, Dmitry Marakasov wrote:
I'm working on this. For now I came up with this query to return all site+distfile pairs for all ports, which can then be converted to real URLs. Does it look right? I'm particularly puzzled with _distfiles'
SUFX
vsN
but it seems to work with ports using either.SELECT _distfiles.FullPkgPath AS FullPkgPath, _sites.Value AS Sites, _fetchfiles.Value AS Files FROM _distfiles JOIN _fetchfiles ON KeyRef=_Distfiles.Value JOIN _sites ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.
As I mentioned MUCH EARLIER in the thread, unless you really want the full list of all files which are grabbed by the port, in the new scheme of things, THE MAIN DISTFILE for a given port will be located at ROACH_URL, grabbable from ROACH_SITE.
This is your best guess at figuring out whether a port is up-to-date (more or less), unless you get into stuff which releases as main files along with mammoth patches. Stuff like vim, for instance.
Note that this is also the data that other tools like portroach are supposed to use.
By just grabbing this info, instead of painting the shed red, you will get OpenBSD to >99% functionality again.
What does repology need exactly?
https://repology.org/docs/requirements TLDR: identifiers/names (portname, pkgname, origin), version, summary, maintainer, categories, licenses, cpe info, homepage and download urls. In the future, depends.
The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page
Thank you, but it contains only the basic info which is already clear from the database structure. It doesn't answer my question regarding SUFX
vs N
, neither does basic description you've given.
N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.
As I understand, that is _distfiles.Type
, not N
.
Back to N
and SUFX
, if my guess is right, this is an example of distfiles-sites linkage via SUFX
:
sqlite> select * from _distfiles where fullpkgpath=4448;
FullPkgPath|Value|N|SUFX|Type
4448|18545||.github|0
4448|18546||.github|0
sqlite> select * from _sites where fullpkgpath=4448;
FullPkgPath|N|Value
4448|.github|https://github.com/
4448|.gitlab|https://gitlab.com/
4448|.srht|https://git.sr.ht/
And this is and example of linkage via N
:
sqlite> select * from _distfiles where fullpkgpath=13146;
FullPkgPath|Value|N|SUFX|Type
13146|131315|||0
13146|131316|0||0
13146|131317|1||0
13146|131318|2||0
sqlite> select * from _sites where fullpkgpath=13146;
FullPkgPath|N|Value
13146||https://github.com/stumpwm/stumpwm/archive/
13146|0|https://github.com/edicl/cl-ppcre/archive/
13146|1|https://beta.quicklisp.org/archive/clx/2022-07-07/
13146|2|https://beta.quicklisp.org/archive/alexandria/2022-07-07/
coalesce
as in the query above should handle both cases, am I right?
As I mentioned MUCH EARLIER in the thread, unless you really want the full list of all files which are grabbed by the port, in the new scheme of things, THE MAIN DISTFILE for a given port will be located at ROACH_URL, grabbable from ROACH_SITE.
I would have already implemented ROACH_*
support much earlier if that was suitable, wouldn't I?
Yes, I need the full list of all files.
On Tue, Sep 19, 2023 at 05:13:55AM -0700, Dmitry Marakasov wrote:
N has to do with the distfiles table containing DISTFILES, SUPDISTFILES and PATCHFILES.
As I understand, that is
_distfiles.Type
, notN
.Back to
N
andSUFX
, if my guess is right, this is an example of distfiles-sites linkage viaSUFX
:sqlite> select * from _distfiles where fullpkgpath=4448; FullPkgPath|Value|N|SUFX|Type 4448|18545||.github|0 4448|18546||.github|0 sqlite> select * from _sites where fullpkgpath=4448; FullPkgPath|N|Value 4448|.github|https://github.com/ 4448|.gitlab|https://gitlab.com/ 4448|.srht|https://git.sr.ht/
And this is and example of linkage via
N
:sqlite> select * from _distfiles where fullpkgpath=13146; FullPkgPath|Value|N|SUFX|Type 13146|131315|||0 13146|131316|0||0 13146|131317|1||0 13146|131318|2||0 sqlite> select * from _sites where fullpkgpath=13146; FullPkgPath|N|Value 13146||https://github.com/stumpwm/stumpwm/archive/ 13146|0|https://github.com/edicl/cl-ppcre/archive/ 13146|1|https://beta.quicklisp.org/archive/clx/2022-07-07/ 13146|2|https://beta.quicklisp.org/archive/alexandria/2022-07-07/
hence,
coalesce
in the query above should, which should handle both cases. So am I right?
Ah right, I had entirely forgotten about that part. The "N" part is transient.
It's there so that we know about stuff where we haven't converted from the old way to the new way (it still records the old DISTFILES = somefile:0 -> MASTER_SITES0 as opposed to the new DISTFILES.sufx = somefile -> SITES.sufx.
Don't spend too much time on that field, it will go away as soon as I find time to finish a full sweep of the ports tree :)
In fact, even if you "miss" out on a few files, I would advise not even trying to support it. That part of the schema will go away entirely as soon as I find the time to finish the sweep.
As forecasted, I just finished removing MASTER_SITES0..9
So you don't need to worry about N, it's gone in sqlports 7.49
Fantastic! Thank you so much for the quick adaptation!
Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.
diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index 076ad7ac..5962b880 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -95,7 +95,6 @@ def _iter_sqlports(path: str) -> Iterator[Port]:
yield Port(**row_dict, distfiles_cursor=distfiles_cursor)
-# TODO: drop _distfiles.N for sqlports >= 7.49
_DISTFILES_QUERY = """
SELECT
_sites.Value AS sites,
@@ -104,7 +103,7 @@ FROM _distfiles
JOIN _fetchfiles
ON KeyRef=_Distfiles.Value
LEFT JOIN _sites
- ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
+ ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (_distfiles.SUFX is not distinct from _sites.N)
WHERE
_distfiles.FullPkgPath = ?
"""
Following-up (somewhat): in the more generic case of all package distributions, we all have different rules about splitting up things and how "unique" things get.
I think I might get some students (!!!!) looking at it this year.
For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).
Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.
I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.
Fantastic! Thank you so much for the quick adaptation!
I won't count that as quick, but you're welcome)
Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.
Yesterday sqlports was still at 7.48 so couldn't drop it right away.
On Tue, Sep 26, 2023 at 07:46:21AM -0700, Dmitry Marakasov wrote:
Fantastic! Thank you so much for the quick adaptation!
I won't count that as quick, but you're welcome)
Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.
Yesterday sqlports was still at 7.48 so couldn't drop it right away.
Depending on our snaps, you'll soon get 7.49. after all, it still takes a day to run a full bulk...
For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).
Repology primarily counts projects, which is specifically "number of supported software", so apart from a few exceptions subpackages and splits does not count. For instance, despite libxml2
is packaged
CentOS 6:
pname | version
---------+---------
libxml2 | 2.7.6
ConanCenter:
pname | version
---------+---------
libxml2 | 2.10.3
libxml2 | 2.10.4
libxml2 | 2.11.3
libxml2 | 2.11.4
libxml2 | 2.9.10
libxml2 | 2.9.12
libxml2 | 2.9.13
libxml2 | 2.9.14
libxml2 | 2.9.9
Cygwin:
pname | version
------------------------+---------
libxml2 | 2.10.4
libxml2-debuginfo | 2.10.4
libxml2-devel | 2.10.4
libxml2-doc | 2.10.4
python27-libxml2 | 2.10.4
python36-libxml2 | 2.10.4
python37-libxml2 | 2.10.4
python38-libxml2 | 2.10.4
python39-libxml2 | 2.10.4
mingw64-i686-libxml2 | 2.9.4
mingw64-x86_64-libxml2 | 2.9.4
it's counted once for each repo. Of course there are discrepancies, and you can estimate the error rate by percentage of "unique" projects (not matched with any other repository), which is usually below 15%, or by percentage of unique projects which have "related" ones (with matches by url, which mean they can potentialy be merged with something) which is usually below 10%.
Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.
I don't have a problem with it. There's minor problem with a lot of flavors which Repology cannot deduplicate yet (thus showing a bunch of similar entries on e.g. package versions page), but that does not affect statistics and statuses and will be fixed on Repology side someday.
I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.
As said, I don't feel much need to. From what I can see, most unmatched projects in OpenBSD are truly unique or have ambiguous names and have unmatched counterparts in other repos, so there's no OpenBSD specific problems here. Still, it can be improved somewhat by submitting rules to repology or renaming packages in OpenBSD where that makes sense. For instance, I see some potentially misnamed python modules, like devel/py-test-expect
may be better called devel/py-pytest-expect
to match PyPI.
Depending on our snaps, you'll soon get 7.49. after all, it still takes a day to run a full bulk...
Yes, it's already there and I've deployed updated parser.
Renaming packages is a bitch. We got a mechanism for that, but it is fairly expensive on the updater-side.
Well, just suggesting, it's not fatal as it is.
Looking at some false positives on OpenBSD, I noticed repology says security/ghidra is out of date.
However that port is marked BROKEN on all archs right now. So it is not built.
Therefore does the below change make sense? It filters out ports that are broken on ALL archs.
This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD. (in fact the security issues were noted as part of the reason it was disabled. see: marc.info
This diff purposely does not filter out ports that are only broken on SOME archs.
diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index e88fcc33..6357d297 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -58,6 +58,8 @@ FROM _Ports
ON Categories_ordered.FullPkgpath=_Ports.FullPkgpath
JOIN _Email
ON _Email.KeyRef=MAINTAINER
+WHERE
+ _Ports.FullPkgPath NOT IN (SELECT DISTINCT FullPkgPath FROM _Broken where Arch IS NULL)
"""
Another possible false positive.
I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability.
Blindly guessing I see that we have CVE-2023-40217 which states that some affected versions include "[3.10.0, 3.10.13)". To my knowledge a square bracket on the right would mean version <= 3.10.13, while a parenthesis means version < 3.10.13. i.e. strictly less than version 3.10.13.
Is it possible repology isn't marking the affected versions correctly? Or did I miss a vulnerability that's in Python 3.10.13 on my side?
This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD
This makes sense, but I don't feel good about hiding information. Repology's target audience is not only maintainers but also upstream, and visibility of broken packages is especially important for these. Also there are maintainers from other repositories who may find even disabled package useful. Showing it as outdated is also a valid call for action. The point that its vulnerability does not in fact affect users because the package is not available is valid though, and it would be fair to at least exclude it from count of vulnerable packages. This requires a bit of development though (#1352).
I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability
You can click an exclamation mark to get a page which highlights CVEs matched with a given version: https://repology.org/project/python/cves?version=3.10.13
On Wed, Sep 27, 2023 at 02:53:34PM -0700, Dmitry Marakasov wrote:
This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD
This makes sense, but I don't feel good about hiding information. Repology's target audience is not only maintainers but also upstream, and visibility of broken packages is especially important for these. Also there are maintainers from other repositories who may find even disabled package useful. Showing it as outdated is also a valid call for action. The point that its vulnerability does not in fact affect users because the package is not available is valid though, and it would be fair to at least exclude it from count of vulnerable packages. This requires a bit of development though (#1352).
Our ghidra is totally out-of-date. There are few OpenBSD developers who care about updating java software.
I have zero idea about the status of our patches to make ghidra run, whether they were committed to upstream (apparently there's at least an issue).
From what I can see, it looks like somewhat poorly architectured software that actually wants lists and lists of OSes instead of doing somewhat OO stuff... well, not that I'm surprised considering the usual verbosity of java.
Anyhow, if someone is interested enough in running a recent ghidra on OpenBSD, they're welcome to try and update the port and submit patches.
some recent changes made it to sqlports (cf https://github.com/openbsd/ports/commits/master/databases/sqlports), which might have broken the openbsd updater in repology according to @ajacoutot. @marcespie might know better, i'm only the messenger and havent looked at the actual changes nor potential breakage.
right now sqlports doesnt have a "stable api" but https://marc.info/?l=openbsd-ports&m=169419062712772&w=2 hints that there might be one someday.