Rewrite the lexer to properly handle all multi-word units

djmattyg007 commented 3 years ago

After my last message on #7 I somewhat rewrote the code from scratch and managed to solve my problem AND implement full support for all multi-word units. The only reason I've marked this PR as a draft is because I haven't had a chance to write a proper description for this PR. It's almost midnight here and I've been working on this all day, and I'm very tired :stuck_out_tongue: :sleeping:

That said, as far as I can tell this Just Works, and includes a bunch of tests too. I haven't added tests for every single unit, but there's a lot there and it's easy to add more. The tests should be pretty comprehensive too.

Oh, one more thing before I forget: before the rewrite, light seconds were being parsed as light years. I've fixed that up too.

djmattyg007 commented 3 years ago

Another note to self: I'm pretty sure is_alphabetic_extended is no longer used, and should be cleaned up.

djmattyg007 commented 3 years ago

Yet another note to self: there should be some tests for lexing unit conversion operators like in and to.

probablykasper commented 3 years ago

Nice!

I tried measuring performance difference with this:

cargo run --release -- '1 kilometer + 2 light year - 2 light year - 2 light year - 2 light year - 2 light year' -v

Lexing time went from 29-30µs to 36-38µs (+25%). I'm guessing this is because of the graphemes parsing?

djmattyg007 commented 3 years ago

Lexing time went from 29-30µs to 36-38µs (+25%). I'm guessing this is because of the graphemes parsing?

Yep, that's most likely. Fortunately it's small enough in real time that it hopefully won't make much of a difference in practical terms.

If I were to pass $2000000000..., currently cpc would read $ and then return an error. With this, I think the lexer goes through the whole string first to parse the graphemes, and then starts reading the $ to return the error.

The code I wrote relies on having random access to the entire string. There are a few places where we need to conditionally advance the pointer based on what the next word actually is. Some examples include parsing of "watt hours", "pounds-force" and "newton-metres".

One of the challenges I faced originally was the inability to get the current value from the iterator. It was only possible to get the next value. By switching to a vector, I made the code a lot simpler to understand.

The handling of words like "watt" is a little bit of a weakness. In the specific example of "watt", if the next word isn't "hours" (or its variants), if we determine that the next word isn't what we're looking for, we essentially throw away the parsing effort for that next word. The next iteration of the main outside loop has to re-parse it. We could potentially work around these problems if we felt it was important enough, but my instinct says it would make the code more difficult to understand.

djmattyg007 commented 3 years ago

I will write up a proper description for this PR later today, as I'm starting work now.

Another note to self: there should be some tests for expressions with parentheses.

probablykasper commented 3 years ago

Right, because watt is also valid by itself. What if it's parsed like this:

let tokens = [];

fn parse_token(word) {
  match word {
    "watt" => {
      match read_next_word() {
        "hour" => tokens.push(WattHour),
        other => {
          tokens.push(Watt);
          parse_token(other);
        },
      }
    },
  }
}

while word = read_next_word() {
  parse_token(word);
}

probablykasper commented 3 years ago

Having a crack at implementing what I mentioned, seems possible so far.

One of the challenges I faced originally was the inability to get the current value from the iterator

I'm getting around that by making an iterator peekable, then using .peek() to check the "current" value, then .next() to move along

probablykasper commented 3 years ago

One minor issue I'm having is handling units like pound-. I read the word pound, and then peek and see a -. To know if that should be minus or pound-force, I need to check if the next character is f, but to do that I need first need to consume - (can't peek twice). I'm seeing 3 solutions:

Clone the iterator (not sure about the implications)
Add Minus token from deep within the word parser
Later on turn tokens Unit(Pound) + Operator(Minus) + LexerKeyword(Force) into Unit(PoundForce). Would result in pound - force being parsed

probablykasper commented 3 years ago

Committed my implementation so far here: https://github.com/probablykasper/cpc/tree/lexer-multiword (missing newton and pound lexing).

It does look nice and clean, but it doubled the lexing time, yikes

djmattyg007 commented 3 years ago

A little bit of code review:

let mut word = "".to_string();

Can this can be replaced with the following?

let mut word = String::new();

The multi-word form of "light seconds" is still returning LightYear tokens:

"sec" | "secs" | "second" | "seconds" => Token::Unit(LightYear),

A lot of the "second word" match blocks have this:

string => return Err(format!("Invalid string: {}", string)),

Shouldn't it be matching _ rather than string?

djmattyg007 commented 3 years ago

It does look nice and clean, but it doubled the lexing time, yikes

Do you have any ideas for why this might be? At first glance it doesn't look like that should happen. Maybe it's to do with all the additional function calls? My implementation only uses a couple of closures - I'm guessing they're handled differently.

Do you have any ideas for how to handle newton and pound lexing with this model? They were the biggest reason for me abandoning the iterator approach.

djmattyg007 commented 3 years ago

I just took another look at the implementation of "watt hours":

    "watt" => {
      match read_word(chars).as_str() {
        "hr" | "hrs" | "hour" | "hours" => Token::Unit(WattHour),
        _ => Token::Unit(Watt)
      }
    }

How does this behave when the second word isn't "hour"? Doesn't it result in the second word being thrown away?

djmattyg007 commented 3 years ago

I just pulled down your branch to test, and got the following output:

❯ cargo build --release --locked
   Compiling serde v1.0.125
   Compiling bitflags v1.2.1
   Compiling libc v0.2.93
   Compiling cc v1.0.67
   Compiling ord_subset v3.1.1
   Compiling rustc-serialize v0.3.24
   Compiling unicode-segmentation v1.8.0
   Compiling decimal v2.1.0
   Compiling cpc v1.6.0 (/home/djmattyg007/repos/rust/cpc)
    Finished release [optimized] target(s) in 9.61s

❯ target/release/cpc '12 watt plus 23 watt'
1: 1
1: w
1:  
1: 2
1: w
Parsing error: Expected end of input, found Number(23) at 2

❯ target/release/cpc '12 watt + 23 watt'
1: 1
1: w
1: +
1:  
1: 2
1: w
35 Watt

probablykasper commented 3 years ago

let mut word = "".to_string();
Can this can be replaced with the following?
let mut word = String::new();

Sure, changed those

The multi-word form of "light seconds" is still returning LightYear tokens:
"sec" | "secs" | "second" | "seconds" => Token::Unit(LightYear),

Fixed, thank you

A lot of the "second word" match blocks have this:
string => return Err(format!("Invalid string: {}", string)),
Shouldn't it be matching _ rather than string?

That should work fine, it's the same thing except it uses the string. If my finger slips and I type light sex instead of light sec, it'll say Invalid string: sex

probablykasper commented 3 years ago

It does look nice and clean, but it doubled the lexing time, yikes

Do you have any ideas for why this might be? At first glance it doesn't look like that should happen. Maybe it's to do with all the additional function calls? My implementation only uses a couple of closures - I'm guessing they're handled differently.

I'm not sure, I expected it to be about as fast as yours. I tried changing read_word to write to an argument word: &mut String instead of returning a String, but that didn't improve it.

Do you have any ideas for how to handle newton and pound lexing with this model? They were the biggest reason for me abandoning the iterator approach.

I think any of the 3 solutions I mentioned should work

djmattyg007 commented 3 years ago

How does this behave when the second word isn't "hour"? Doesn't it result in the second word being thrown away?

I discovered that my implementation had a similar problem. The latest commit I just pushed should resolve that issue.

djmattyg007 commented 3 years ago

I think any of the 3 solutions I mentioned should work

Ah, I misunderstood what you wrote originally. I think converting the series of tokens (option 3) is probably the best approach, even if it results in something like pounds - force being parsed as if it was pounds-force.

probablykasper commented 3 years ago

Great news! I fixed the issue where tokens after watt are ignored, and in the process I accidentally improved the performance (31ms now).

I opened a draft PR for it: https://github.com/probablykasper/cpc/pull/18

djmattyg007 commented 3 years ago

Have you compared the memory usage amongst the various implementations?

probablykasper commented 3 years ago

@djmattyg007 I haven't, not a major concern I've had for cpc. How do you go about checking?

djmattyg007 commented 3 years ago

You'll need to use a profiling tool. A cursory search revealed the following references that may or may not be useful (I haven't used any of them personally):

https://nnethercote.github.io/perf-book/profiling.html https://gist.github.com/KodrAus/97c92c07a90b1fdd6853654357fd557a https://github.com/tikv/pprof-rs https://crates.io/crates/criterion

probablykasper commented 3 years ago

Had a hard time finding anything that works easily on macOS :/

probablykasper commented 3 years ago

Nevermind, there's a built-in Instruments app, which sounds like it would work. I'm assuming Allocations > All Heap & Anonymous VM > Total Bytes is what needs to be looked at.

Results of cpc '34 cubic feet + 23 cubic yards':

52.17KiB - master
53.33KiB - Your PR #16
52.33KiB - My PR #18

djmattyg007 commented 3 years ago

Nice work! Would you be able to share the command(s) you ran? It would be good to document them somewhere if it's non-trivial.

probablykasper commented 3 years ago

This command generates the a .trace file:

instruments -t Allocations ./path-to-cpc "34 cubic feet + 23 cubic yards"

Then if you open the .trace file, you be able to should see this list: Screen Shot 2021-07-06 at 9 39 38 PM

I also found out you can use this command:

/usr/bin/time -l ./path-to-cpc "34 cubic feet + 23 cubic yards"

which will tell you the maximum resident set size, but when I tried it now it gave me 1073152. If that's bytes, it would be 1048KiB, which doesn't match the results from Instruments

probablykasper commented 3 years ago

Oh, Instruments is part of Xcode, not built-in

djmattyg007 commented 3 years ago

Thank you :) Even though I'm not a mac user, that's still helpful to know!

probablykasper / cpc

Rewrite the lexer to properly handle all multi-word units #16