repology / repology-updater

Repology backend service to update repository and package data
https://repology.org
GNU General Public License v3.0
499 stars 175 forks source link

Hackage repository is not being updated #1385

Closed utdemir closed 6 months ago

utdemir commented 7 months ago

I released version 0.4.0 for nix-tree a while ago on Hackage but it didn't drop to Repology: https://repology.org/project/haskell:nix-tree/versions

Looking at the Hackage page on Repology, I'm seeing most of the graphs flat which if I understand correctly means that the entire repo is not updating.

I am not sure if this repo is the best place to put this, so let me know and I'm happy to move this issue to somewhere else.

AMDmi3 commented 7 months ago

Yes, parsing is broken: https://repology.org/log/3820044 Someone needs to write a proper parser for cabal format.

utdemir commented 7 months ago

Thanks for your reply.

a proper parser for cabal format.

That sounds challenging :). It's a weird format.

I am seeing a Python implementation here, but I think it would be too much work to replicate the entire format with it. Do you think there's a way we can reuse existing Cabal implementation on Haskell side? Something akin to building a cabal-to-json executable with Haskell, and calling it from Python side?

AMDmi3 commented 7 months ago

Something akin to building a cabal-to-json executable with Haskell, and calling it from Python side?

I strongly prefer to avoid that.

That sounds challenging :). It's a weird format.

That shouldn't be too challenging as long as there's a spec/PEG somehere. I wrote that parser on a whim and assumed that format is indentation based, while it turns out not to be and we're currently failing on that. So is there a spec?

sellout commented 6 months ago

I wrote that parser on a whim and assumed that format is indentation based, while it turns out not to be and we're currently failing on that. So is there a spec?

I don’t think there is a spec, unfortunately. But the format is indentation-based.

However, I think the specific parsing failures are secondary. The parser seems to mostly work, with some (or at least one) packages hitting edge cases. Failing to parse a single cabal file shouldn’t be a fatal failure of the Hackage parser – it should be able to continue and have a successful run, with some packages left un-updated.

Perhaps a failure threshold (possibly shared across all parsers) would be worthwhile – if a certain percentage of packages are failing to parse, give up because likely some format changed and the entire parser needs a real update.

sellout commented 6 months ago

Since the previously posted log has expired (and I’m not sure it failed in the same way), here’s the latest one: https://repology.org/log/4428851. The tail of that file:

2024-04-23 08:10:54   AspectAG/0.7.0.1/AspectAG.cabal: ERROR: link: "www.fing.edu.uy/~jpgarcia/AspectAG" does not look like an URL (schema missing)
2024-04-23 08:10:58   exception-hierarchy/0.1.0.11/exception-hierarchy.cabal: ERROR: link: "yet" does not look like an URL (schema missing)
2024-04-23 08:10:58   hw-string-parse/0.0.0.5/hw-string-parse.cabal: ERROR: parsing failed (fatal): KeyError: 'name'
2024-04-23 08:10:58 ERROR: KeyError: 'name'

There seem to be a number of non-fatal errors, where the Hackage parser continues happily to the next Cabal file, but then the failure of hw-string-parse escapes the normal failure handling and causes the entire Hackage parse to fail. Unfortunately, my Python is quite weak, so it’s not immediately obvious to me where or how to catch the failure that’s escaping.

AMDmi3 commented 6 months ago

It fails on the last mentioned cabal file:

 cabal-version: 2.2

name:                   hw-string-parse
version:                0.0.0.5
x-revision: 2
synopsis:               String parser
description:            Please see README.md
category:               Data, Bit
stability:              Experimental
homepage:               http://github.com/haskell-works/hw-string-parse#readme
bug-reports:            https://github.com/haskell-works/hw-string-parse/issues
author:                 John Ky
maintainer:             newhoggy@gmail.com
copyright:              2016-2021 John Ky
license:                BSD-3-Clause
license-file:           LICENSE
tested-with:            GHC == 9.2.2, GHC == 9.0.2, GHC == 8.10.7, GHC == 8.8.4, GHC == 8.6.5
build-type:             Simple
extra-source-files:     README.md

The parser assumes indentation based format, while the indentation is broken.

sellout commented 6 months ago

Yeah, that looks like a bug in the Cabal file. The Cabal docs don’t allow whitespace before cabal-version. I’m not sure what Cabal itself does with that.

  1. Discards the field as part of a missing stazna and parses the file as Cabal 1.1 (or whatever version was before cabal-version was required)?
  2. Parses a bit more liberally than the ABNF allows, and reads the intended cabal-version?

In either case, that doesn’t look like an issue with Repology’s Hackage parser, so then the only issue here is that the failure of that Cabal file isn’t contained, but leads to the failure of the entire Hackage parser.

AMDmi3 commented 6 months ago

issue here is that the failure of that Cabal file isn’t contained, but leads to the failure of the entire Hackage parser.

This behavior is intentional.

sellout commented 6 months ago

This behavior is intentional.

Why is it intentional? It seems like allowing a single broken package to prevent the other 17k packages from being updated is a bit unbalanced.

I understand that Repology would want to be made aware of failures and attempt to correct them, but the logs already contain that data, whether or not the parser actually fails. I see that packaged can be “ignored” in some way – would that be the right way to avoid this? (Just asking out of curiosity – I opened an issue against hw-string-parse (linked above), so I’m hoping this particular case can be resolved quickly enough.)

As an aside – I have subscribed to the atom feeds for the stuff I maintain, I wonder if there’s an atom feed for a particular repo’s failures (and warnings). I would happily subscribe to the one for Hackage so I can be proactive about PRs to fix issues.

I know maintaining something like this (and dealing with the tickets) is a lot of work. I’m curious about this one in particular because I would be inclined to submit a PR myself, but clearly the solution I had in mind is not one that would be accepted.

sellout commented 6 months ago

With the fix of haskell-works/hw-string-parse#42, the Hackage parser is happy again. But I’m still interested in making it less fragile.

AMDmi3 commented 6 months ago

I understand that Repology would want to be made aware of failures and attempt to correct them, but the logs already contain that data, whether or not the parser actually fails.

Repology wants to provide consistent data, and skipping some packages ruins consistency. Also you cannot expect anyone to look for and examine any logs in the case of missing package which itself would never be noticed.

I see that packaged can be “ignored” in some way – would that be the right way to avoid this?

That has nothing to do with parsing, it affects version comparison.

As an aside – I have subscribed to the atom feeds for the stuff I maintain, I wonder if there’s an atom feed for a particular repo’s failures (and warnings). I would happily subscribe to the one for Hackage so I can be proactive about PRs to fix issues.

There are parsing logs and problems (there are per-maintainer problem view as well) which are somewhat similar yet are not integrated, but neither provide feeds.

With the fix of https://github.com/haskell-works/hw-string-parse/issues/42, the Hackage parser is happy again

Great work, thank you!