Is module DFS really needed?

ivanperez-keera commented 1 year ago

The module DFS implements graph algorithms. From that module, only the functions out and t_close are used externally (in NFA.hs). The functions list_tree, mk_graph, edges, rev_edges, reverse_graph, scc, top_sort and dff, and the type Edge, are not used.

However, there's something else that strikes me from that module, and points to a potential cleaning opportunity:

The top comment reads "This module is a portable version of [...] [an] encoding of [...] linear graph algorithms. This module uses balanced binary trees instead of mutable arrays to implement the depth-first search so the complexity of the algorithms is n.log(n) instead of linear." However, isn't linear better than n times log(n)??? Why go for the latter if you can opt for the former?
There is an implementation of graphs in Data.Graph in containers (and containers is already a dependency of alex), although I haven't checked if that was there already in the earliest version of GHC supported by alex, which I think is GHC 7 (?). The two functions used externally, out and t_close, could be defined, if I understand their purpose correctly, as:

out :: Graph -> Int -> [Int]
out = Data.Array.(!)

t_close :: Graph -> Graph
t_close g = buildG (bounds g) [ (v1, v2) | v1 <- vertices g, v2 <- reachable g v1 ]

List comprehensions are generally slower than map so there's probably a faster way of building t_close if performance is an issue here.

My questions are:

Should this module be replaced by a use of Data.Graph?
If not, should the unused functions mentioned above be removed?

andreasabel commented 1 year ago

Wise man said: Never change a running system. So, I am a bit skeptic.

Since this code has been subjected to 30 years of testing in practice, it is likely bug-free. So, what incentive is there to change it?
If one could demonstrate a performance gain using some other implementation, via realistic benchmarks, this would be an argument to change things.

Concerning Data.Graph, does it have an efficient transitive-closure algorithm? The one you suggest looks like it would not have the best performance (it is not the Warshall algorithm).

[ (v1, v2) | v1 <- vertices g, v2 <- reachable g v1 ]

Concerning removal of unused functions: Wouldn't this make the little graph library DFS less complete? Unused functions will get removed by the linker automatically, wouldn't they?

ivanperez-keera commented 1 year ago

Since this code has been subjected to 30 years of testing in practice, it is likely bug-free. So, what incentive is there to change it?

Reducing the lines of code in a system is always a good incentive. And since we are relying on a standard Haskell library, we can make similar assertions about the quality of the code we are relying on based on usage (this module was there in containers-0.1). Note that both modules are based on the same idea (both cite the Launchbury/King paper).

Concerning Data.Graph, does it have an efficient transitive-closure algorithm?

The function reachable is advertised as 𝑂(𝑉+𝐸), and traversing over vertices is 𝑂(𝑉), so the overall function should be doable in 𝑂(𝑉+𝐸) * 𝑂(𝑉) with a concatMap. Floyd-Warshall is 𝑂(𝑉^3) .

Note that if we cannot implement this in comparable or better time than Sort's for similarly sparse/dense graphs, it suggests that this implementation might be faster than Data.Graph and so containers should incorporate it.

Wouldn't this make the little graph library DFS less complete?

But will anyone ever use that? Otherwise, alex contains the perfect DFS library... that nobody will use.

Also worth noting is that only the functions that have effectively been used from the Sort module have been subjected to such long-term "tests". The ones that have not been used have been subjected to (near) zero tests. For all we know, they could be 100% wrong. Keeping them there would incorrectly signal that this code has been subjected to 30 years of testing in the wild, when maybe only 2 functions have.

And why stop there? Why not add more potentially useful (but actually unused) functions, if that makes it somehow "better"?

There's very clear criteria when it comes to the code that should be added in the system: is it needed? If not, then it's out. (And we may even be able to tell whether any of it was ever used, demonstrating even more the "practical uselessness" of keeping it around.)

Unused functions will get removed by the linker automatically, wouldn't they?

They should (that also takes compilation time, btw, even if little). But the executable size is not the biggest concern here.

A bigger concern is that many maintenance and QA actions are directly and strongly related to the number of lines of code in a system. More lines, more code, more cost. More to clean. More to test. More to maintain. More to adjust and adapt. More to understand when you are new to a system. More to document. More to read to figure out if / where / how it is used and why it's there if it's not used.

There is an action that reduces that cost dramatically: removing unused code.

We can help alex, happy and other systems "live" for another 30 years by making the jobs of current and future maintainers easier.

haskell / alex

Is module DFS really needed? #232