kawu / dawg

Directed acyclic word graphs
BSD 2-Clause "Simplified" License
11 stars 1 forks source link

Optimization #22

Open kawu opened 11 years ago

kawu commented 11 years ago

Structures

There are functions and structures which should be optimized, if possible. Profiling shows that the most time consuming functions are related to the HashMap implementation:


COST CENTRE MODULE %time %alloc

insertUnsafe Data.DAWG.HashMap 7.8 10.0 deleteUnsafe Data.DAWG.HashMap 7.5 8.0 lookup Data.DAWG.HashMap 3.7 0.2 lookupUnsafe Data.DAWG.HashMap 3.7 0.3


Surprisingly, lookup and lookupUnsafe (which are evaluated almost the same number of times) need the same amount of time and memory. Nevertheless, if we substitute lookup for lookupUnsafe in the library code, number of (==) evaluations with respect to HashMap keys (automaton states) is greatly increased, so it seems that workload of HashMap functions is primarily related to the HashMap structure itself.

Other functions, which summarily consume a lot of resources, are from the Graph module:


decIngo Data.DAWG.Graph 5.9 11.3 remNode.nodeMap' Data.DAWG.Graph 4.8 6.0 remNode.ingoMap' Data.DAWG.Graph 4.0 6.0 newNode.nodeMap' Data.DAWG.Graph 3.6 6.9 newNode.ingoMap' Data.DAWG.Graph 3.2 6.9 newNode.(...) Data.DAWG.Graph 3.0 2.7 decIngo.k Data.DAWG.Graph 2.7 0.5 nodeBy Data.DAWG.Graph 2.6 0.3 newNode Data.DAWG.Graph 2.5 3.0 ...


Algorithms

Another optimization which comes to mind is to change the algorithm of (path, value) insertion. We can take the number of ingoing edges into account, i.e. there's no need to clone states with only one in-transition (root state, for example), we can just modify them. But if we modify the set of transitions outgoing from the particular state, hash of the state will also change (unless we fundamentally change the method of hash computation), so perhaps its better to leave this idea for another time.

apskii commented 11 years ago

It would be very cool to have faster construction (fromList/fromLang). Currently, it takes ~50 seconds to construct DAWG from 25 MB dictionary.