haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
297 stars 82 forks source link

renew groupEquivStates #176

Closed damhiya closed 3 years ago

damhiya commented 3 years ago

Hi This pr introduces some optimizations

  1. use IntSet instead of [SNum] while constructing bigmap.
  2. use IM.restrictKeys for calculating xs.
  3. filter out null sets from xs. It seems most performance improvements came from this.
  4. use R and Q instead of P and Q. following pseudo code describes what it means.
    
    -- Hopcroft's Algorithm for DFA minimization (cut/pasted from Wikipedia):
    P := {{all accepting states}, {all nonaccepting states}};
    Q := {{all accepting states}};
    while (Q is not empty) do
     choose and remove a set A from Q
     for each c in ∑ do
          let X be the set of states for which a transition on c leads to a state in A
          for each set Y in P for which X ∩ Y is nonempty and Y \ X is nonempty do
               replace Y in P by the two sets X ∩ Y and Y \ X
               if Y is in Q
                    replace Y in Q by the same two sets
               else
                    add the smaller of the two sets to Q
          end;
     end;
    end;

-- observation : Q is always subset of P -- let R = P \ Q. then following algorithm is the equivalent of the Hopcroft's Algorithm R := {{all nonaccepting states}}; Q := {{all accepting states}}; while (Q is not empty) do choose a set A from Q remove A from Q and add it to R for each c in ∑ do let X be the set of states for which a transition on c leads to a state in A for each set Y in R for which X ∩ Y is nonempty and Y \ X is nonempty do replace Y in R by the greater of the two sets X ∩ Y and X \ Y add the smaller of the two sets to Q end; for each set Y in Q for which X ∩ Y is nonempty and Y \ X is nonempty do replace Y in Q by the two sets X ∩ Y and Y \ X end; end; end;


(You can also see this at comments in DFAMin.hs)
Due to this change, we don't need to iterate over `Q` to check whether `Y ∈ Q` or not.
And we can implement this efficiently using only lists.

note1 : I fixed missing condition in pseudo code at _line28_
note2 : I used 2 space indent because I couldn't catch any unified rule for indenting from source code
damhiya commented 3 years ago

I just removed restrictKeys from the code as its use causes a dependency problem while the performance gain is not that much.

Time complexity of restrictKeys is O(n+m) while time complexity of repeated use of findWithDefault is O(min(n,W)*m). In Hopcroft's algorithm, n is the number of all states and m is the number of states in each equivalence class.

Ericson2314 commented 3 years ago

Hi @damhiya. I am sorry I missed this until now. I am hoping the docs can be further clarified (a fault of the situation before, to be clear), and maybe a few other things, and then this should be ready to go.

Ericson2314 commented 3 years ago

BTW checkout https://github.com/haskell/containers/pull/616 which I need finish at some point. This filtering the empty sets indeed comes up a lot, and I hope to eventually be able to use that here.

damhiya commented 3 years ago

Notice1. While the original Hopcroft's Algorithm has only one kinds of accepting states, alex's DFA allows multiple kinds of accepting states since alex is a lexer. However, there is no documentation about this difference in the alex repository. So I didn't really get about correctness of Hopcroft's algorithm that has multiple kinds of accepting states.

Notice2. I expected to dividing P into Q and R will make significant performance improvement, but I could not see any difference in test. Indeed, the changes excepting filtering null cases from xs has almost no effect to performance. I only tested GHC's lexer, so it might require other test cases to observe the effect of new implementation.

treeowl commented 3 years ago

The documented time complexity bound is almost certainly not tight. I would bet dollars to doughnuts that restrictKeys is not worse for big O.

On Sat, Jan 30, 2021, 9:45 PM damhiya notifications@github.com wrote:

@damhiya commented on this pull request.

In src/DFAMin.hs https://github.com/simonmar/alex/pull/176#discussion_r567354641:

  • xs = filter (not . IS.null)
  • . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a])
  • $ IM.elems bigmap

Let me describe why I discouraged to use restrictKeys

Time complexity of restrictKeys is O(n+m) while time complexity of repeated use of findWithDefault is O(min(n,W)*m). In Hopcroft's algorithm, n is the number of all states and m is the number of states in each equivalence class.

n is number of all states that is pretty big. m is the number of states in each equivalence class. That means, in the end of the algorithm, m might be 1 or small number if the input DFA was already almost optimized. Then the effectual time complexity would be O(n) for restrictKeys and O(W*m) for findWithDefault. So, restrictKeys might be slower then the repeated findWithDefault.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/pull/176#discussion_r567354641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7N24MFHMU3MLZXU7JDS4S75LANCNFSM4V3N6BRA .

damhiya commented 3 years ago

IM.restrictKeys x (IS.singleton y) have to work in O(min(n,W)) to beats competing implementation in every possible inputs. I'm not sure that's the case.

The documented time complexity bound is almost certainly not tight. I would bet dollars to doughnuts that restrictKeys is not worse for big O. On Sat, Jan 30, 2021, 9:45 PM damhiya @.> wrote: @*.** commented on this pull request. ------------------------------ In src/DFAMin.hs <#176 (comment)>: > + xs = filter (not . IS.null) + . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a]) + $ IM.elems bigmap Let me describe why I discouraged to use restrictKeys Time complexity of restrictKeys is O(n+m) while time complexity of repeated use of findWithDefault is O(min(n,W)m). In Hopcroft's algorithm, n is the number of all states and m is the number of states in each equivalence class. n is number of all states that is pretty big. m is the number of states in each equivalence class. That means, in the end of the algorithm, m might be 1 or small number if the input DFA was already almost optimized. Then the effectual time complexity would be O(n) for restrictKeys and O(Wm) for findWithDefault. So, restrictKeys might be slower then the repeated findWithDefault. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#176 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7N24MFHMU3MLZXU7JDS4S75LANCNFSM4V3N6BRA .

treeowl commented 3 years ago

I'll check. If it doesn't, that can and will be fixed.

On Sat, Jan 30, 2021, 10:10 PM damhiya notifications@github.com wrote:

IM.restrictKeys x (IS.singleton y) have to work in O(min(n,W)) to beats competing implementation in every possible inputs. I'm not sure that's the case.

The documented time complexity bound is almost certainly not tight. I would bet dollars to doughnuts that restrictKeys is not worse for big O. … <#m-2272180050457183973> On Sat, Jan 30, 2021, 9:45 PM damhiya @.*> wrote: @.* commented on this pull request. ------------------------------ In src/DFAMin.hs <#176 (comment) https://github.com/simonmar/alex/pull/176#discussion_r567354641>: > + xs = filter (not . IS.null) + . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a]) + $ IM.elems bigmap Let me describe why I discouraged to use restrictKeys Time complexity of restrictKeys is O(n+m) while time complexity of repeated use of findWithDefault is O(min(n,W)m). In Hopcroft's algorithm, n is the number of all states and m is the number of states in each equivalence class. n is number of all states that is pretty big. m is the number of states in each equivalence class. That means, in the end of the algorithm, m might be 1 or small number if the input DFA was already almost optimized. Then the effectual time complexity would be O(n) for restrictKeys and O(W*m) for findWithDefault. So, restrictKeys might be slower then the repeated findWithDefault. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#176 (comment) https://github.com/simonmar/alex/pull/176#discussion_r567354641>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7N24MFHMU3MLZXU7JDS4S75LANCNFSM4V3N6BRA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/pull/176#issuecomment-770317861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7LYPVTYVS3QAW63VNDS4TC4FANCNFSM4V3N6BRA .

treeowl commented 3 years ago

I believe that documentation is wrong too. Great! The real bounds are unfortunately not very beautiful or simple, but in practice restrictKeys should be better.

On Sat, Jan 30, 2021, 10:40 PM John Ericson notifications@github.com wrote:

@Ericson2314 commented on this pull request.

In src/DFAMin.hs https://github.com/simonmar/alex/pull/176#discussion_r567359496:

  • xs = filter (not . IS.null)
  • . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a])
  • $ IM.elems bigmap

I don't think that's right? Haddock says restrictKeys is O(m*log(n/m + 1)), m <= n, and findWithDefault is O(log n). So I that means O(log n) vs O(n log n) when n >> m. Intuitively this makes sense. With the repeated lookups, we have to traverse through the same intermediate nodes again and again. But with restrictKeys, we do everything in one pass.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/pull/176#discussion_r567359496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7PXUTYXH6L2CF3O6C3S4TGJLANCNFSM4V3N6BRA .

Ericson2314 commented 3 years ago

Oh I replied before seeing the out-of-thread conversation, but yes I agree with @treeowl, see https://github.com/simonmar/alex/pull/176#discussion_r567359496. I think It boils down to repeated lookups intermediate nodes too many times, but restrict keys just doing once (well yes, returns to information from them using stack, but that has great locality).

treeowl commented 3 years ago

Right, but the IntMap one is better than those bounds would suggest.

On Sat, Jan 30, 2021, 10:57 PM damhiya notifications@github.com wrote:

@damhiya commented on this pull request.

In src/DFAMin.hs https://github.com/simonmar/alex/pull/176#discussion_r567361072:

  • xs = filter (not . IS.null)
  • . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a])
  • $ IM.elems bigmap

@Ericson2314 https://github.com/Ericson2314 You have to see IntMap.restrictKeys, not Map.restrictKeys. They have different big-O

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/pull/176#discussion_r567361072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7JD7WOYJP6SOENC2A3S4TILPANCNFSM4V3N6BRA .

treeowl commented 3 years ago

Yes, there's definitely a shortcut in the code when either map is a singleton, and in fact that applies recursively too.

On Sat, Jan 30, 2021, 10:10 PM damhiya notifications@github.com wrote:

IM.restrictKeys x (IS.singleton y) have to work in O(min(n,W)) to beats competing implementation in every possible inputs. I'm not sure that's the case.

The documented time complexity bound is almost certainly not tight. I would bet dollars to doughnuts that restrictKeys is not worse for big O. … <#m-2272180050457183973> On Sat, Jan 30, 2021, 9:45 PM damhiya @.*> wrote: @.* commented on this pull request. ------------------------------ In src/DFAMin.hs <#176 (comment) https://github.com/simonmar/alex/pull/176#discussion_r567354641>: > + xs = filter (not . IS.null) + . map (\m -> IS.unions [IM.findWithDefault IS.empty s m | s <- IS.toList a]) + $ IM.elems bigmap Let me describe why I discouraged to use restrictKeys Time complexity of restrictKeys is O(n+m) while time complexity of repeated use of findWithDefault is O(min(n,W)m). In Hopcroft's algorithm, n is the number of all states and m is the number of states in each equivalence class. n is number of all states that is pretty big. m is the number of states in each equivalence class. That means, in the end of the algorithm, m might be 1 or small number if the input DFA was already almost optimized. Then the effectual time complexity would be O(n) for restrictKeys and O(W*m) for findWithDefault. So, restrictKeys might be slower then the repeated findWithDefault. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#176 (comment) https://github.com/simonmar/alex/pull/176#discussion_r567354641>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7N24MFHMU3MLZXU7JDS4S75LANCNFSM4V3N6BRA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/pull/176#issuecomment-770317861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOF7LYPVTYVS3QAW63VNDS4TC4FANCNFSM4V3N6BRA .

Ericson2314 commented 3 years ago

Oh I see. I was looking at the regular map bounds, oops.

damhiya commented 3 years ago

Hmm, It seems travis ci does not work

Please be aware travis-ci.org will be shutting down in several weeks, with all accounts migrating to travis-ci.com. Please stay tuned here for more information.

Ericson2314 commented 3 years ago

Hmm, It seems travis ci does not work

For now it works, but is very slow. I would like to switch to something else but don't have the perms.

Ericson2314 commented 3 years ago

OK, looks like that's everything. Thanks again!