haskell / pvp

Haskell Package Version Policy (PVP)
http://pvp.haskell.org/
38 stars 25 forks source link

Statistics ideas for master's thesis #63

Open jonkri opened 1 month ago

jonkri commented 1 month ago

I'm about to do a master's thesis in Software Engineering. I would like to apply (Bayesian) statistics and, ideally, conduct some kind of experiment. I posted a message on Haskell-Cafe about it yesterday. I have also asked the Hackage administrator to see if I could have access to the Hackage metadata.

I was wondering if you have any suggestions for statistical questions that I could look into that would be of interest from a PVP point of view, for example some kind of analysis related to dependencies or breakages.

Thanks!

hasufell commented 1 month ago

I think it would be interesting to know:

All the things I proposed kinda require to also have an understanding of the API of the package, not just the metadata.

I'm not sure that's within your scope. But it can be done statically.

jonkri commented 1 month ago

Very interesting! Thank you, @hasufell!

I wonder what would be a good way of determining the API of packages. 🤔 Could GHCi's :browse/:browse! command suffice, perhaps? Or would I need to dig deeper, perhaps getting into parsing .hi files?

ulysses4ever commented 1 month ago

@jonkri at the Cabal project, we are looking into API checking to ensure PVP based on https://github.com/Kleidukos/print-api This package is in early development, so be warned :-) There are downsides to it (but they're probably inherent to any tool based on GHC API), which you can read about here: https://github.com/haskell/cabal/pull/10259

jonkri commented 1 week ago

@ulysses4ever: Thanks for letting me know about print-api!

Since I'm interested in analyzing API changes over time, I wonder how far back print-api and Cabal could go.

For example, do you think it would be possible to use modern Cabal to build old packages such as OpenGL-2.1 from 2006 (assuming C headers could be provided for FFI)?

$ cabal update hackage.haskell.org,2006-11-02T14:21:52Z
$ cabal get OpenGL-2.1
$ cd OpenGL-2.1
$ cabal install
Warning: Requested index-state 2006-11-02T14:21:52Z is newer than
'hackage.haskell.org'! Falling back to older state (2006-11-02T14:21:40Z).
Error: cabal: Could not resolve dependencies:
[__0] trying: OpenGL-2.1 (user goal)
[__1] next goal: OpenGL:setup.Cabal (dependency of OpenGL)
[__1] rejecting: OpenGL:setup.Cabal-3.10.3.0/installed-3.10.3.0 (conflict:
OpenGL => OpenGL:setup.Cabal>=1.0 && <1.25)
[__1] fail (backjumping, conflict set: OpenGL, OpenGL:setup.Cabal)
After searching the rest of the dependency tree exhaustively, these were the
goals I've had most trouble fulfilling: OpenGL, OpenGL:setup.Cabal

I'm not sure what “OpenGL:setup.Cabal>=1.0 && <1.25” means. Is that a constraint on the version of Cabal?

fgaz commented 1 week ago

Yes, that's a constraint on the version of Cabal (the library) used to build the setup script (Setup.hs). It isn't specified in the OpenGL .cabal file, so it defaults to <1.25.

You can use a newer cabal-the tool to build packages that have a setup script requiring an older Cabal up to a point (depending on GHC). If you use a newer index-state (probably because of the OpenGL revision, but I'm not sure how exactly), you get a better error that includes this line:

constraint from minimum version of Cabal used by Setup.hs requires >=3.12

So that package is past the cabal-install+GHC → Cabal compatibility window.

The compatibility window is determined by a common lower bound of 1.20 plus a bound based on the GHC version you are using.

  -- GHC 8.2   needs  Cabal >= 2.0
  -- GHC 8.0   needs  Cabal >= 1.24

So you might be able to build the package by using GHC<8.2.

...or you could try to allow a newer Cabal with --allow-newer=OpenGL.setup:Cabal.

gbaz commented 1 week ago

There is some prior art on some statistics in this old IFL paper. The field is sometimes called "empirical software engineering" or more specifically "mining software repositories" -- would be nice to have continued work in this regard:

https://ifl2014.github.io/submissions/ifl2014_submission_14.pdf

There is also a paper on stackage which is interesting as well: https://arxiv.org/abs/2310.10887

ulysses4ever commented 1 week ago

@jonkri software archeology is hard in general, and Haskell doesn't make it much easier. For one particular package, you could probably put some effort and build it way back: you'll probably have to set up older GHCs, as noted above, and for some of those you'll need an older GLIBC, and, for that, perhaps, an older OS altogether. Having native dependencies (like with OpenGL) makes it harder. Finally, doing it at scale (e.g. tens or hundreds or thousands of packages) --- I doubt it. Again, this is a hard problem in general. Depending on your goals, you may be better off by picking a more syntactic approach that wouldn't require you to compile code before you can analyze it.

jonkri commented 1 week ago

Thanks, @gbaz and @ulysses4ever!

In order to identify API changes, do you think it would make sense to crawl the Haddock pages on Hackage, or Hoogle databases, and identify the package APIs from there? 🤔

I'm currently running the following naive script to collect print-api types for all LTS 22.40 packages (including previous versions) that build successfully with GHC 9.6.6 from a haskell:9.6.6 Docker image. $1 is the package and $2 is the version.

#!/bin/bash
set -e

echo "Fetching upload time for $1-$2..."
curl -s https://hackage.haskell.org/package/$1-$2/upload-time > /out/$1-$2-upload-time
sed -e '$a\' -i /out/$1-$2-upload-time

upload_time=$(cat /out/$1-$2-upload-time)

echo "Downloading Cabal index for $upload_time..."
cabal update hackage.haskell.org,$upload_time

echo "Extracting $1-$2..."
mkdir /package
tar -C /package -f package.tar.gz --strip-components=1 -xz
cd package

echo "Building $1-$2..."
cabal build --write-ghc-environment-files=always

echo "Extracting API from $1-$2..."
print-api -p $1 > /out/$1-$2-api

So far I've basically gotten through all packages starting with an upper case letter and gotten around 300 print-api specifications.

Edit: Mentioned Hoogle databases.

jonkri commented 6 days ago

Looking at the Hackage package index now (in parallell to what's being discussed above). Is there anything from the Cabal metadata that you would be interested in knowing? For example, would it be interesting to know to what extent (upper or lower) bounds are used in build-depends?

jonkri commented 6 days ago

@gbaz: Thanks again for the papers, they were an interesting read! Regarding the first paper, did you have anything specific in mind when you said it would be nice to have continued work in this regard?

jonkri commented 5 days ago

@hasufell:

come up with some vague estimations about man-hours spent on updating one's package for one dependency (major bump) and then calculate the total amount of man-hours wasted in the entire ecosystem per, say, year (bonus points if you include GHC)

Perhaps this isn't what you meant, but I'm wondering if work related to breakages really should be seen as waste. I'm thinking that packages breaking to some extent can be seen as a natural evolution of a healthy and innovative package ecosystem, and that it's a balance. Either way, I guess it could be useful to have an estimate of the time it takes to fix a breakage. I guess the most straight-forward way to measure this would be to measure the time it takes until a package which fixes the breakage is released.

jonkri commented 5 days ago

Here are the research questions I've considered have so far:

Edit: “Broken” replaced with “outdated” and “fix” replaced with “adopt to”.

phadej commented 5 days ago

How long are packages remaining broken when there is a breaking changes in dependencies? How much work is required to fix breaking changes in dependencies?

I don't like the usage of broken in this context (And fix) That implies that downstream developer had made some mistake. They didn't, they cannot predict the future.

Use outdated and update/adopt:

How long are packages remaining outdated when there is breaking change in their dependencies? How much works is required to adopt to breaking change in dependencies?

tomjaguarpaw commented 5 days ago

I don't think it implies that the downstream developer made some mistake. If I'm hit by a car whose driver violated the speed limit and my leg is broken as a result, that doesn't imply I made a mistake. It implies that something damaged me indeed, and the task may lie with me to improve the situation. but not as a result of my fault.

jonkri commented 5 days ago

Thanks, @phadej! I updated the questions.

Here's another question:

For example, I'm thinking it could be useful if package maintainers could be notified when something breaks.

phadej commented 5 days ago

I don't think it implies that the downstream developer made some mistake. If I'm hit by a car whose driver violated the speed limit and my leg is broken as a result, that doesn't imply I made a mistake. It implies that something damaged me indeed, and the task may lie with me to improve the situation. but not as a result of my fault.

That's interesting example.

If I'm hit by a car

That's related language issue. By some it's considered as blame removal (car driver was just sitting in a car), removing agency from a person behind the steering wheel (of a car). Google about the topic.

While I agree that "Pedestrian was killed by a car" is acceptable in a non-formal discussion, it's bad news headline. Similarly, if someone is writing a thesis, they should take care to use as good language as they can.

So, please don't imply any extra blame on downstream having bounds according to agreed version policy. If someone opens a ticket to a library saying that "Your library is broken: there are restricting upper bounds, and there is compilation error if bounds are relaxed; my knee jerk reaction may be "It's not broken, I didn't tell that it supports GHC-9.10; in fact I did tell that it doesn't (or at least I don't know), that's why bounds are there" and close the ticket as invalid.

IMHO better approach is to open ticket with a positive language, "Add support for GHC-9.10" etc. that's a feature request not a bug report.

jonkri commented 5 days ago
  • what do people use the 4th and 5th etc. version components for

@hasufell: What's the 5th version component? Do you mean version tags, perhaps? If so, they are not supported anymore.

hasufell commented 5 days ago

What's the 5th version component?

I don't know. That's the question. PVP does not specify it.

See the spec:

A package version number SHOULD have the form A.B.C, and MAY optionally have any number of additional components

It is not limited to 4 components.