mjy / nexus_parser

A parser/lexer for Nexus (phylogenetic) files.
Other
1 stars 1 forks source link

RESPECTCASE in FORMAT block causes infinite loop in CHARACTERS block parsing #8

Closed kleintom closed 6 months ago

kleintom commented 6 months ago

FORMAT DATATYPE = STANDARD RESPECTCASE GAP = - MISSING = ? SYMBOLS = " 0 1 2 3 4 5 6 7";

At a quick glance it looks like the FORMAT parser is looking for 'value pairs': https://github.com/mjy/nexus_parser/blob/6a647b32be6ec19a4bf7c07fed13bb35ae9f5743/lib/nexus_parser/parser.rb#L152-L156

whereas RESPECTCASE is standalone, so it parses the first pair DATATYPE = STANDARD and then leaves the next token on the parser as @next_token=#<NexusParser::Tokens::Label:0x00007fd09e1157d0 @value="RESPECTCASE" which nobody handles, so you get an infinite loop.

I'm not sure what should happen here, maybe just return an 'unsupported' error?

mjy commented 6 months ago

I'm not sure what should happen here, maybe just return an 'unsupported' error?

Been a long time since I played with lexer/parser. But yes, two solutions exist- 1) error, 2) add a Token/spec and see if it will "just work". Because the Token is literal it might be very straightforward to hit it and, by default just do nothing but move on I suspect.

Downstream I think we likely need to respect case when we create the CharacterState. I.e. the processor or middle layer will need to normalize to the expected format before we stub CharacterStates. Mental note to check to see whether we respect case in CharacterState, and add if we don't.

mjy commented 6 months ago

@kleintom I handled it silently. If we find this produces lossey parses we will have to rethink that, i.e. this is quick and dirty.