Closed djmattyg007 closed 3 years ago
Another note to self: I'm pretty sure is_alphabetic_extended
is no longer used, and should be cleaned up.
Yet another note to self: there should be some tests for lexing unit conversion operators like in
and to
.
Nice!
I tried measuring performance difference with this:
cargo run --release -- '1 kilometer + 2 light year - 2 light year - 2 light year - 2 light year - 2 light year' -v
Lexing time went from 29-30µs to 36-38µs (+25%). I'm guessing this is because of the graphemes parsing?
Lexing time went from 29-30µs to 36-38µs (+25%). I'm guessing this is because of the graphemes parsing?
Yep, that's most likely. Fortunately it's small enough in real time that it hopefully won't make much of a difference in practical terms.
If I were to pass $2000000000..., currently cpc would read $ and then return an error. With this, I think the lexer goes through the whole string first to parse the graphemes, and then starts reading the $ to return the error.
The code I wrote relies on having random access to the entire string. There are a few places where we need to conditionally advance the pointer based on what the next word actually is. Some examples include parsing of "watt hours", "pounds-force" and "newton-metres".
One of the challenges I faced originally was the inability to get the current value from the iterator. It was only possible to get the next value. By switching to a vector, I made the code a lot simpler to understand.
The handling of words like "watt" is a little bit of a weakness. In the specific example of "watt", if the next word isn't "hours" (or its variants), if we determine that the next word isn't what we're looking for, we essentially throw away the parsing effort for that next word. The next iteration of the main outside loop has to re-parse it. We could potentially work around these problems if we felt it was important enough, but my instinct says it would make the code more difficult to understand.
I will write up a proper description for this PR later today, as I'm starting work now.
Another note to self: there should be some tests for expressions with parentheses.
Right, because watt
is also valid by itself. What if it's parsed like this:
let tokens = [];
fn parse_token(word) {
match word {
"watt" => {
match read_next_word() {
"hour" => tokens.push(WattHour),
other => {
tokens.push(Watt);
parse_token(other);
},
}
},
}
}
while word = read_next_word() {
parse_token(word);
}
Having a crack at implementing what I mentioned, seems possible so far.
One of the challenges I faced originally was the inability to get the current value from the iterator
I'm getting around that by making an iterator peekable, then using .peek()
to check the "current" value, then .next()
to move along
One minor issue I'm having is handling units like pound-
. I read the word pound
, and then peek and see a -
. To know if that should be minus or pound-force
, I need to check if the next character is f
, but to do that I need first need to consume -
(can't peek twice). I'm seeing 3 solutions:
Minus
token from deep within the word parserUnit(Pound)
+ Operator(Minus)
+ LexerKeyword(Force)
into Unit(PoundForce)
. Would result in pound - force
being parsedCommitted my implementation so far here: https://github.com/probablykasper/cpc/tree/lexer-multiword (missing newton
and pound
lexing).
It does look nice and clean, but it doubled the lexing time, yikes
A little bit of code review:
let mut word = "".to_string();
Can this can be replaced with the following?
let mut word = String::new();
The multi-word form of "light seconds" is still returning LightYear
tokens:
"sec" | "secs" | "second" | "seconds" => Token::Unit(LightYear),
A lot of the "second word" match blocks have this:
string => return Err(format!("Invalid string: {}", string)),
Shouldn't it be matching _
rather than string
?
It does look nice and clean, but it doubled the lexing time, yikes
Do you have any ideas for why this might be? At first glance it doesn't look like that should happen. Maybe it's to do with all the additional function calls? My implementation only uses a couple of closures - I'm guessing they're handled differently.
Do you have any ideas for how to handle newton
and pound
lexing with this model? They were the biggest reason for me abandoning the iterator approach.
I just took another look at the implementation of "watt hours":
"watt" => {
match read_word(chars).as_str() {
"hr" | "hrs" | "hour" | "hours" => Token::Unit(WattHour),
_ => Token::Unit(Watt)
}
}
How does this behave when the second word isn't "hour"? Doesn't it result in the second word being thrown away?
I just pulled down your branch to test, and got the following output:
❯ cargo build --release --locked
Compiling serde v1.0.125
Compiling bitflags v1.2.1
Compiling libc v0.2.93
Compiling cc v1.0.67
Compiling ord_subset v3.1.1
Compiling rustc-serialize v0.3.24
Compiling unicode-segmentation v1.8.0
Compiling decimal v2.1.0
Compiling cpc v1.6.0 (/home/djmattyg007/repos/rust/cpc)
Finished release [optimized] target(s) in 9.61s
❯ target/release/cpc '12 watt plus 23 watt'
1: 1
1: w
1:
1: 2
1: w
Parsing error: Expected end of input, found Number(23) at 2
❯ target/release/cpc '12 watt + 23 watt'
1: 1
1: w
1: +
1:
1: 2
1: w
35 Watt
let mut word = "".to_string();
Can this can be replaced with the following?
let mut word = String::new();
Sure, changed those
The multi-word form of "light seconds" is still returning
LightYear
tokens:"sec" | "secs" | "second" | "seconds" => Token::Unit(LightYear),
Fixed, thank you
A lot of the "second word" match blocks have this:
string => return Err(format!("Invalid string: {}", string)),
Shouldn't it be matching
_
rather thanstring
?
That should work fine, it's the same thing except it uses the string. If my finger slips and I type light sex
instead of light sec
, it'll say Invalid string: sex
It does look nice and clean, but it doubled the lexing time, yikes
Do you have any ideas for why this might be? At first glance it doesn't look like that should happen. Maybe it's to do with all the additional function calls? My implementation only uses a couple of closures - I'm guessing they're handled differently.
I'm not sure, I expected it to be about as fast as yours. I tried changing read_word
to write to an argument word: &mut String
instead of returning a String
, but that didn't improve it.
Do you have any ideas for how to handle
newton
andpound
lexing with this model? They were the biggest reason for me abandoning the iterator approach.
I think any of the 3 solutions I mentioned should work
How does this behave when the second word isn't "hour"? Doesn't it result in the second word being thrown away?
I discovered that my implementation had a similar problem. The latest commit I just pushed should resolve that issue.
I think any of the 3 solutions I mentioned should work
Ah, I misunderstood what you wrote originally. I think converting the series of tokens (option 3) is probably the best approach, even if it results in something like pounds - force
being parsed as if it was pounds-force
.
Great news! I fixed the issue where tokens after watt
are ignored, and in the process I accidentally improved the performance (31ms now).
I opened a draft PR for it: https://github.com/probablykasper/cpc/pull/18
Have you compared the memory usage amongst the various implementations?
@djmattyg007 I haven't, not a major concern I've had for cpc. How do you go about checking?
You'll need to use a profiling tool. A cursory search revealed the following references that may or may not be useful (I haven't used any of them personally):
https://nnethercote.github.io/perf-book/profiling.html https://gist.github.com/KodrAus/97c92c07a90b1fdd6853654357fd557a https://github.com/tikv/pprof-rs https://crates.io/crates/criterion
Had a hard time finding anything that works easily on macOS :/
Nevermind, there's a built-in Instruments app, which sounds like it would work. I'm assuming Allocations > All Heap & Anonymous VM > Total Bytes
is what needs to be looked at.
Results of cpc '34 cubic feet + 23 cubic yards'
:
Nice work! Would you be able to share the command(s) you ran? It would be good to document them somewhere if it's non-trivial.
This command generates the a .trace
file:
instruments -t Allocations ./path-to-cpc "34 cubic feet + 23 cubic yards"
Then if you open the .trace
file, you be able to should see this list:
I also found out you can use this command:
/usr/bin/time -l ./path-to-cpc "34 cubic feet + 23 cubic yards"
which will tell you the maximum resident set size
, but when I tried it now it gave me 1073152
. If that's bytes, it would be 1048KiB, which doesn't match the results from Instruments
Oh, Instruments is part of Xcode, not built-in
Thank you :) Even though I'm not a mac user, that's still helpful to know!
After my last message on #7 I somewhat rewrote the code from scratch and managed to solve my problem AND implement full support for all multi-word units. The only reason I've marked this PR as a draft is because I haven't had a chance to write a proper description for this PR. It's almost midnight here and I've been working on this all day, and I'm very tired :stuck_out_tongue: :sleeping:
That said, as far as I can tell this Just Works, and includes a bunch of tests too. I haven't added tests for every single unit, but there's a lot there and it's easy to add more. The tests should be pretty comprehensive too.
Oh, one more thing before I forget: before the rewrite, light seconds were being parsed as light years. I've fixed that up too.