Fix issue of buildInputs paths mismatching with outputPath and add propagatedBuildInputs

zz1874 commented 1 year ago

Problem

As mentioned in #20, there are a large number of packages whose buildInputs has a suffix -dev, but the outputPath we get for this package doesn't have any suffix, which causes the mismatching issue.

For example, the buildInputs of firefox is ['/nix/store/7h5psr5dn8lmypz2n5r3y1sq0pj2n3bh-gtk+3-3.24.34-dev'], but the outputPath of gtk+3-3.24.34 is '/nix/store/2j2znigd8ak37rlwh9khz0ry3clqlw1l-gtk+3-3.24.34'. When we have our node firefox, we cannot find the corresponding outputPath of gtk+3, therefore the edge between firefox and gtk+3 will not be added, which as a result leads to a relatively small graph.

Solution

When extracting data from nix, apart from the outputPath, for each package we also extract paths with all suffixes and store them as a list called outputPathAll, and their names as a list called outputNameAll.

When adding an edge to the graph in the ETL process, instead of searching the path of buildInputs in outputPath, we first get the name of the buildInputs and look into outputPathAll with the same name to find if there's a match in between.

Here, we get the name of the buildInputs first and use df.query("name == @package.name") to narrow the size of dataframe so that we don't need to traverse all dataframe to find the matching path.

We also added propagatedBuildInputs as another property in the database and apply the same trick when adding the edges labeled propagatedBuildInputs. (See #19 )

This is a screenshot of the Gremlin query to see the dependency graph of firefox

g.V()
.filter{it.get().value('pname').matches('firefox')}
.repeat(outE().otherV().simplePath())
.times(2)
.path()
.by('pname')
.by('label')
.limit(10)

Screenshot from 2023-02-13 18-04-17@2x

dorranh commented 1 year ago

Thanks for preparing this! Will need to defer to @GuillaumeDesforges and/or @jeicher for the review until I'm back from leave.

zz1874 commented 1 year ago

We need to define which formatter to use. I would prefer black.

All Python symbols should use snake_case convention. Ref: https://peps.python.org/pep-0008/#descriptive-naming-styles

I think this code base can use some comments and docstirngs now that it's getting bigger.

Let's use type hinting as much as possible. Tools like pyright are then very helpful.

The Python alternative to Haskell Maybe (Scala's Option, Rust's Option) is a type union T | None which has been historically written Optional[T]. Ref: https://docs.python.org/3/library/typing.html#typing.Optional

Nice catch!! I'll apply a formatter and do other modifications. Thanks for the tips!

zz1874 commented 1 year ago

Hey @dorranh , the PR is ready for review :rocket:

tweag / nixpkgs-graph-explorer

Fix issue of buildInputs paths mismatching with outputPath and add propagatedBuildInputs #25

Problem

Solution