lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.88k stars 414 forks source link

Trailing whitespace in Lark code ends up in standalone module #1193

Closed bcr closed 1 year ago

bcr commented 2 years ago

Describe the bug

Using the standalone generator tool generates a module with trailing line whitespace that comes from trailing line whitespace in the Lark source code.

One suggestion is to trim trailing whitespace when emitting the module (handles the case where it may happen again.) Another is to change it in the source (but it might happen again.)

I can appreciate that this is a bit pedantic, I am prepared to post-process the output to remove whitespace.

To Reproduce

If you make the simplest grammar possible (start:) and generate a standalone module from it, there are lines that have trailing whitespace. These are included from various files. I have removed the whitespace and the generated module no longer has it.

I can fork and open a PR if you want to make this change, but the diffs are straightforward and included below. I have tested this in my local environment and it worked.

diff --git a/lark/exceptions.py b/lark/exceptions.py
index da982e3..dafe0b3 100644
--- a/lark/exceptions.py
+++ b/lark/exceptions.py
@@ -72,7 +72,7 @@ class UnexpectedInput(LarkError):
             after = text[pos:end].split(b'\n', 1)[0]
             return (before + after + b'\n' + b' ' * len(before.expandtabs()) + b'^\n').decode("ascii", "backslashreplace")

-    def match_examples(self, parse_fn: 'Callable[[str], Tree]', 
+    def match_examples(self, parse_fn: 'Callable[[str], Tree]',
                              examples: Union[Mapping[T, Iterable[str]], Iterable[Tuple[T, Iterable[str]]]],
                              token_type_match_fallback: bool=False,
                              use_accepts: bool=True
@@ -229,7 +229,7 @@ class UnexpectedToken(ParseError, UnexpectedInput):

     def __init__(self, token, expected, considered_rules=None, state=None, interactive_parser=None, terminals_by_name=None, token_history=None):
         super(UnexpectedToken, self).__init__()
-        
+
         # TODO considered_rules and expected can be figured out using state
         self.line = getattr(token, 'line', '?')
         self.column = getattr(token, 'column', '?')
diff --git a/lark/lark.py b/lark/lark.py
index 8da9840..9f80ef0 100644
--- a/lark/lark.py
+++ b/lark/lark.py
@@ -310,7 +310,7 @@ class Lark(Serialize):
                 else:
                     if self.options.cache is not True:
                         raise ConfigurationError("cache argument must be bool or str")
-                        
+
                     try:
                         username = getpass.getuser()
                     except Exception:
@@ -339,7 +339,7 @@ class Lark(Serialize):
                     pass
                 except Exception: # We should probably narrow done which errors we catch here.
                     logger.exception("Failed to load Lark from cache: %r. We will try to carry on.", cache_fn)
-                    
+
                     # In theory, the Lark instance might have been messed up by the call to `_load`.
                     # In practice the only relevant thing that might have been overwritten should be `options`
                     self.options = old_options
@@ -609,7 +609,7 @@ class Lark(Serialize):
     def get_terminal(self, name: str) -> TerminalDef:
         """Get information about a terminal"""
         return self._terminals_dict[name]
-    
+
     def parse_interactive(self, text: Optional[str]=None, start: Optional[str]=None) -> 'InteractiveParser':
         """Start an interactive parsing session.

diff --git a/lark/parser_frontends.py b/lark/parser_frontends.py
index e162edf..4e28e36 100644
--- a/lark/parser_frontends.py
+++ b/lark/parser_frontends.py
@@ -74,7 +74,7 @@ class ParsingFrontend(Serialize):

         if lexer_conf.postlex:
             self.lexer = PostLexConnector(self.lexer, lexer_conf.postlex)
-    
+
     def _verify_start(self, start=None):
         if start is None:
             start_decls = self.parser_conf.start
@@ -94,7 +94,7 @@ class ParsingFrontend(Serialize):
         kw = {} if on_error is None else {'on_error': on_error}
         stream = self._make_lexer_thread(text)
         return self.parser.parse(stream, chosen_start, **kw)
-    
+
     def parse_interactive(self, text=None, start=None):
         chosen_start = self._verify_start(start)
         if self.parser_conf.parser_type != 'lalr':
erezsh commented 2 years ago

I agree, it's very pedantic, but that's okay! If you want to submit this as a PR, I'm fine with it.

bcr commented 2 years ago

OK, my plan is to fix both. I'll try out an rstrip also and see if you like that.

diff --git a/lark/tools/standalone.py b/lark/tools/standalone.py
index 87e9031..9684350 100644
--- a/lark/tools/standalone.py
+++ b/lark/tools/standalone.py
@@ -85,7 +85,7 @@ def extract_sections(lines):
             else:
                 raise ValueError(line)
         elif section:
-            text.append(line)
+            text.append(line.rstrip())

     return {name: ''.join(text) for name, text in sections.items()}
bcr commented 2 years ago

I'll make sure to at least try it first :)