Open ivanperez-keera opened 1 year ago
Wise man said: Never change a running system. So, I am a bit skeptic.
Since this code has been subjected to 30 years of testing in practice, it is likely bug-free. So, what incentive is there to change it?
If one could demonstrate a performance gain using some other implementation, via realistic benchmarks, this would be an argument to change things.
Concerning Data.Graph
, does it have an efficient transitive-closure algorithm?
The one you suggest looks like it would not have the best performance (it is not the Warshall algorithm).
[ (v1, v2) | v1 <- vertices g, v2 <- reachable g v1 ]
Concerning removal of unused functions: Wouldn't this make the little graph library DFS
less complete?
Unused functions will get removed by the linker automatically, wouldn't they?
Since this code has been subjected to 30 years of testing in practice, it is likely bug-free. So, what incentive is there to change it?
Reducing the lines of code in a system is always a good incentive. And since we are relying on a standard Haskell library, we can make similar assertions about the quality of the code we are relying on based on usage (this module was there in containers-0.1). Note that both modules are based on the same idea (both cite the Launchbury/King paper).
Concerning Data.Graph, does it have an efficient transitive-closure algorithm?
The function reachable is advertised as π(π+πΈ), and traversing over vertices is π(π), so the overall function should be doable in π(π+πΈ) * π(π) with a concatMap
. Floyd-Warshall is π(π^3) .
Note that if we cannot implement this in comparable or better time than Sort
's for similarly sparse/dense graphs, it suggests that this implementation might be faster than Data.Graph
and so containers
should incorporate it.
Wouldn't this make the little graph library DFS less complete?
But will anyone ever use that? Otherwise, alex contains the perfect DFS library... that nobody will use.
Also worth noting is that only the functions that have effectively been used from the Sort
module have been subjected to such long-term "tests". The ones that have not been used have been subjected to (near) zero tests. For all we know, they could be 100% wrong. Keeping them there would incorrectly signal that this code has been subjected to 30 years of testing in the wild, when maybe only 2 functions have.
And why stop there? Why not add more potentially useful (but actually unused) functions, if that makes it somehow "better"?
There's very clear criteria when it comes to the code that should be added in the system: is it needed? If not, then it's out. (And we may even be able to tell whether any of it was ever used, demonstrating even more the "practical uselessness" of keeping it around.)
Unused functions will get removed by the linker automatically, wouldn't they?
They should (that also takes compilation time, btw, even if little). But the executable size is not the biggest concern here.
A bigger concern is that many maintenance and QA actions are directly and strongly related to the number of lines of code in a system. More lines, more code, more cost. More to clean. More to test. More to maintain. More to adjust and adapt. More to understand when you are new to a system. More to document. More to read to figure out if / where / how it is used and why it's there if it's not used.
There is an action that reduces that cost dramatically: removing unused code.
We can help alex, happy and other systems "live" for another 30 years by making the jobs of current and future maintainers easier.
The module
DFS
implements graph algorithms. From that module, only the functionsout
andt_close
are used externally (in NFA.hs). The functionslist_tree
,mk_graph
,edges
,rev_edges
,reverse_graph
,scc
,top_sort
anddff
, and the typeEdge
, are not used.However, there's something else that strikes me from that module, and points to a potential cleaning opportunity:
The top comment reads "This module is a portable version of [...] [an] encoding of [...] linear graph algorithms. This module uses balanced binary trees instead of mutable arrays to implement the depth-first search so the complexity of the algorithms is n.log(n) instead of linear." However, isn't linear better than n times log(n)??? Why go for the latter if you can opt for the former?
There is an implementation of graphs in
Data.Graph
in containers (and containers is already a dependency of alex), although I haven't checked if that was there already in the earliest version of GHC supported by alex, which I think is GHC 7 (?). The two functions used externally,out
andt_close
, could be defined, if I understand their purpose correctly, as:List comprehensions are generally slower than map so there's probably a faster way of building
t_close
if performance is an issue here.My questions are:
Data.Graph
?