Closed zz1874 closed 1 year ago
Thanks for preparing this! Will need to defer to @GuillaumeDesforges and/or @jeicher for the review until I'm back from leave.
We need to define which formatter to use. I would prefer black.
All Python symbols should use snake_case convention. Ref: https://peps.python.org/pep-0008/#descriptive-naming-styles
I think this code base can use some comments and docstirngs now that it's getting bigger.
Let's use type hinting as much as possible. Tools like pyright are then very helpful.
The Python alternative to Haskell
Maybe
(Scala'sOption
, Rust'sOption
) is a type unionT | None
which has been historically writtenOptional[T]
. Ref: https://docs.python.org/3/library/typing.html#typing.Optional
Nice catch!! I'll apply a formatter and do other modifications. Thanks for the tips!
Hey @dorranh , the PR is ready for review :rocket:
Problem
As mentioned in #20, there are a large number of packages whose
buildInputs
has a suffix-dev
, but theoutputPath
we get for this package doesn't have any suffix, which causes the mismatching issue.For example, the buildInputs of
firefox
is['/nix/store/7h5psr5dn8lmypz2n5r3y1sq0pj2n3bh-gtk+3-3.24.34-dev']
, but theoutputPath
ofgtk+3-3.24.34
is'/nix/store/2j2znigd8ak37rlwh9khz0ry3clqlw1l-gtk+3-3.24.34'
. When we have our nodefirefox
, we cannot find the correspondingoutputPath
ofgtk+3
, therefore the edge betweenfirefox
andgtk+3
will not be added, which as a result leads to a relatively small graph.Solution
When extracting data from nix, apart from the
outputPath
, for each package we also extract paths with all suffixes and store them as a list calledoutputPathAll
, and their names as a list calledoutputNameAll
.When adding an edge to the graph in the ETL process, instead of searching the path of
buildInputs
inoutputPath
, we first get thename
of thebuildInputs
and look intooutputPathAll
with the samename
to find if there's a match in between.Here, we get the
name
of thebuildInputs
first and usedf.query("name == @package.name")
to narrow the size of dataframe so that we don't need to traverse all dataframe to find the matching path.We also added
propagatedBuildInputs
as another property in the database and apply the same trick when adding the edges labeledpropagatedBuildInputs
. (See #19 )This is a screenshot of the Gremlin query to see the dependency graph of
firefox