Closed SMartinScottLogic closed 1 year ago
Interesting question, I know that if it can be retained that behavior being optional seems necessary.
my nimbleparse_lsp project generally should assume files start in the Initial
state, and calls lexer
multiple times.
But for streaming lexers (which I don't believe exist), some retention between &str
's would seem necessary but that would likely not be the lexer()
function, but a function on lexer()
's return value, akin to the NewlineCache::feed()
function.
One place I could imagine using this is in my crimson project, each file currently starts with an optional header/schema where type declarations go, which specifies the type of the top-level value. Something to retain the stack would allow me to specify a header/schema at the top of the first file then disallow it for subsequent files. To ensure things aren't sending redundant data or changing the schema between files.
So I guess my personally I think that if it isn't troublesome to implement, it seems like it could be an interesting feature.
Question: If the input to fn lexer(...) leaves the lexer with a non-empty stack, should it be retained for (optional) resumption with a following call to the lexer?
The intention (though the API is definitely not complete for it!) would be that this is a "streaming lexer" (vs. the current NonStreamingLexer
s).
My understanding from the above, is that it would be worth extending the return value from lexer()
to include details of the final state of the start condition stack. I like it, it will help with tests as well!
Edit: I think I see now what I was missing in my comment below, and what you were getting at in your comment above. The tricky evaluation order at construction time has thrown my head for a loop.
So, I think through it maybe that lexer()
might not be the place where this is needed, as that takes the .l file input (this is where I went awry!), rather than
The input which is going to be performing start condition operations, perhaps I am misunderstanding something about your question though, but I don't see how the input to lexer()
.
Once I look at the LRNonStreamingLexer
though and it's combination of new()
and iter()
, I don't see a good way to expose it without, turning that into an something like Iterator<Item=Result<Lexeme ...>, StartState)>
, exposing the target state at every iteration.
Perhaps you have something in mind, but now that I think i'm looking in the right place I don't quite see any convenient way to just get just the final state, without doing something that I imagine might add overhead to each iteration. So, I'm no longer sure if this is something we can do right now, or just need to keep in mind when designing the "streaming lexer", like Laurie mentioned.
I'm willing to be proven wrong on this, but I designed the NonStreaming
interface to be as simple as possible, knowing that it would almost certainly be too simple for streaming. I don't know what the streaming interface should look like: it came under "think about it one day in the far future" :)
For example, if the current lexer can return "half way through", I think we would force the user to add an extra check "the lexer stack is empty" as a proxy for "was my file completely lexed"? That seems like something most people (I assume NonStreaming
is the more common case!) would not expect to have to deal with themselves.
I pushed a commit here, based on trying to get this working in comments example if it helps, but most of it is mentioned in my review. https://github.com/ratmice/grmtools/commit/67a57c6137a857ca13a3164590ad65d9c1c8df9d
Hello @SMartinScottLogic, I wondered how this PR is getting on?
Sorry, I've been on family vacation - I'll aim to get the next commits in in the next couple of days
From: Laurence Tratt @.> Sent: 04 September 2022 10:42 To: softdevteam/grmtools @.> Cc: Simon Martin @.>; Mention @.> Subject: Re: [softdevteam/grmtools] Add operations to start condition processing. (PR #341)
Hello @SMartinScottLogichttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSMartinScottLogic&data=05%7C01%7Csmartin%40scottlogic.com%7C61b39fd0b55d4d3688d808da8e59a75a%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C637978812968196825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8Fg%2FwxV%2FAjD138PBjKQrDMg9J7wPJFZmN%2BYHzonSMxQ%3D&reserved=0, I wondered how this PR is getting on?
- Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsoftdevteam%2Fgrmtools%2Fpull%2F341%23issuecomment-1236299571&data=05%7C01%7Csmartin%40scottlogic.com%7C61b39fd0b55d4d3688d808da8e59a75a%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C637978812968196825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y4e66og%2B8U1ntNkzX%2FLBP8Rq7ihix3ap5rYxB7Rr3vw%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESH56ECMIVEPNHA5U4ZV33V4RVEZANCNFSM56QFSLNQ&data=05%7C01%7Csmartin%40scottlogic.com%7C61b39fd0b55d4d3688d808da8e59a75a%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C637978812968196825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=unAfF4pHWz2D9GtU2OPvI0VQ11Gv9m1hatPMKpEegkU%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>
I guess, I have one minor question that isn't quite about this patch the example, lrpar/examples/start_states/comment.y Wondering if it is intended, or if it would be better to include epsilon in the rules, so that comments alone/lines with no text parse successfully, something like:
%start Expr
%%
Expr: Expr Text | ;
Text: 'TEXT';
Hello @SMartinScottLogic, are you still chugging away on this one?
@ratmice has very patiently and kindly explained what's going in this PR to the point that I've realised my comment in https://github.com/softdevteam/grmtools/pull/341#issuecomment-1214969503 was entirely bone-headed! This PR has nothing to do with streaming / non-streaming and I'm an idiot!
One thing that Matt and I have realised -- although we haven't thrashed out all the details -- relates to the comment in https://github.com/softdevteam/grmtools/pull/341#issuecomment-1214716803. The return type needs to have something like an Option<LexingState>
attached to it. My working guess is that we might just be able to do something as simple as extending LexError
:
pub struct LexError {
span: Span,
lexing_state: Option<LexingState>
}
Basically I think I'd like to get the functionality of this PR into grmtools before we make the next release if possible. I think there are people who will find this useful, and if (as is possible) it requires an API tweak or two, we can bite that bullet alongside the other API changes we have on the master branch.
@SMartinScottLogic I'd love to see this PR get into grmtools before we do the next release (hopefully soon). Is that plausible, do you think?
Sorry for the delay on this one - I've had a number of distractions come up in my personal and professional life in the last few months. I'll hopefully be able to address the concerns on this PR in the next week or so.
From: Laurence Tratt @.> Sent: 02 November 2022 08:19 To: softdevteam/grmtools @.> Cc: Simon Martin @.>; Mention @.> Subject: Re: [softdevteam/grmtools] Add operations to start condition processing. (PR #341)
@SMartinScottLogichttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSMartinScottLogic&data=05%7C01%7Csmartin%40scottlogic.com%7C18941c053fbf44bb6bb308dabcaaf713%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C638029739757987760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=z%2FEpxE0jG%2FmVU6DHcxr3RsaLoihl4pI3VIl%2FCfI51TY%3D&reserved=0 I'd love to see this PR get into grmtools before we do the next release (hopefully soon). Is that plausible, do you think?
- Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsoftdevteam%2Fgrmtools%2Fpull%2F341%23issuecomment-1299785602&data=05%7C01%7Csmartin%40scottlogic.com%7C18941c053fbf44bb6bb308dabcaaf713%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C638029739757987760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v5KgShO0%2FY14LUDuTpwd%2B83XN0oAzEV%2FgkpZZraKtT4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAESH56GGAE35YNGLBKHVJWLWGIPZDANCNFSM56QFSLNQ&data=05%7C01%7Csmartin%40scottlogic.com%7C18941c053fbf44bb6bb308dabcaaf713%7C8979621591b24b6db90be8fe588f6336%7C0%7C0%7C638029739757987760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q7zl2PyYjXAdVkMikAgYvIy1wGdAzvOSXVsff%2FIWNEQ%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>
Excellent :)
I guess, I have one minor question that isn't quite about this patch the example, _lrpar/examples/startstates/comment.y Wondering if it is intended, or if it would be better to include epsilon in the rules, so that comments alone/lines with no text parse successfully, something like:
%start Expr %% Expr: Expr Text | ; Text: 'TEXT';
Changed under 145f8d68eaf5aa0de3338ea56e692211f889f8cc
Could you clarify exactly what you'd expect to be in LexingState
under https://github.com/softdevteam/grmtools/pull/341#issuecomment-1277830747 ?
I'm having some difficulty tracking which suggestions are current and which have been superseded.
@SMartinScottLogic I believe that adding the Option<LexingState>
, is for the following cases, where it can be none
// On receiving 'a' clear the implicit INITIAL state from the stack then push the B state, stack looks like [B].
a <B>'a'
// on receiving 'b' in the B state, Pop the B state, leaving an empty stack due to push/pop imbalance.
<B>b <-B>'b'
// on receiving 'a' when in the INITIAL state Pop the INITIAL state off stack leaving an empty stack.
<INITIAL>a <-INITIAL>'A'
Initially these caused a panic, I had suggested that these could fall back to the initial state via unwrap_or(initial_state)
,
But I believe the final suggestion was to make these return a LexError
, with a lexing_state: None
Edit: Apologies for the lack of clarity on this point.
I think @ratmice has summed up my thinking too. Apart from that I don't have any major comments.
I understand the failure cases in https://github.com/softdevteam/grmtools/pull/341#issuecomment-1304813657, what I'm unsure of is - in other cases where we produce a LexError
(i.e. where lexing_state
should be present) what type is within the Option
? LexingState
is not an existing type - shall I create a new type, and put current start state details in for the current PR?
Good question, and I don't immediately have a good suggestion. @ratmice might?
Aha, I was thinking just the id:usize
, with the contents being state_stack.last()
and the user would then have to call get_start_state_by_id
, but now I see that it isn't pub
(I'm running a bit behind on a holiday so I don't have a chance to see if there are any difficulties with making it pub
though).
There are also some questions if we are making the id
public if it should also have an idxnewtype
associated with it I suppose?
This at least seemed better to me than making the StartState
clone, and putting that directly in the LexError?
There are also some questions if we are making the id public if it should also have an idxnewtype associated with it I suppose?
What do we expect people to do with that state? Is a fully opaque state OK or do they need to peek at the contents to make use of it?
There are also some questions if we are making the id public if it should also have an idxnewtype associated with it I suppose?
What do we expect people to do with that state? Is a fully opaque state OK or do they need to peek at the contents to make use of it?
I think it is just needed to get a struct StartState
via get_start_state_by_id
, and so fully opaque ID would seem fine, once they get the struct StartState
, primarily the printing of the name
, and name_span
fields I believe just so you can discern which state the lexer was in when the error occurred.
The usize itself, being just an incrementing counter of states doesn't really provide much useful information itself really.
Ah, I see, so I think we're saying:
Option<StartState>
.clone
ing a StartState
which commits us to an ugly API.I think I would thus suggest:
StartStateId
opaque type, that's currently a wrapper around usize
.fn get_start_state_by_id(&self, id: usize) -> Option<&StartState>
to pub fn get_start_state_by_id(&self, id: StartStateId) -> Option<&StartState>
.name(&self) -> &str
(or similar) and name_span(&self) -> Span
methods to StartState
so users can work out what start state caused the problem.Does that sound sensible?
Does that sound sensible?
Yeah, sounds alright to me.
Hello Simon, I'm keen to get a new release of grmtools out of the door soon, as we've piled up lots of useful changes since the last release 8 months ago (time flies!). Do you think we can get this PR done for that release? It would be great if we can! -Laurie
Sorry, I'm not understanding the 'opaque type' bit above - I haven't encountered this in rust - so I'm not certain of the desired code. I understand and agree with the remainder, and I've largely completed (using usize for StartStateId at the moment) - c.f. 224744d96caae7d33322b25ddd73666d196110ea
Sorry! By "opaque type" we just mean struct S(SomeHiddenType)
where users can't access SomeHiddenType
directly but can still e.g. compare two S
s for equality, or hash S
s etc. In this case I think we'd just have struct StartStateId(usize)
or similar. The advantage of an opaque type is that we can change the inner type later without imposing an API change on end users.
@SMartinScottLogic Please squash (you'll need to rebase against master and resolve whatever the conflict(s) is/are).
Squashed, as requested
bors r+
Build failed:
I think you'll need to add a new commit for cargo fmt
.
bors r+
I thought I was assigned, but perhaps I'm misremembering. Anyway I don't think bors will listen to me because of that.
Edit: perhaps bors was just being slow!
Build succeeded:
Pending: tests in
lrlex/src/lib/lexer.rs
Question: If the input to
fn lexer(...)
leaves the lexer with a non-empty stack, should it be retained for (optional) resumption with a following call to the lexer?