pyparsing / pyparsing

Python library for creating PEG parsers
MIT License
2.18k stars 274 forks source link

locatedExpr vs Located Migrating code from pyparsing 2.4.7 to pyparsing 3.0.9 #478

Closed anjalyge closed 1 year ago

anjalyge commented 1 year ago

The contents of the file used for parsing

this is a sample page to test parsing
line 0001 line 1
line 0002 line 2

The pattern used to create a parser

p.Literal('line ') + p.Regex(r'(?P<abc>\d+)') + p.SkipTo(p.LineEnd().suppress())

When I use the locatedExpr from pyparsing version 2.4.7 I get the following output

{'locn_start': 55, 'abc': '0002', 'value': ['line ', '0002', ' line 2'], 'locn_end': 71}

When I use the Located class from pyparsing 3.0.9 with the same pattern I get the following output

{'locn_start': 55, 'value': {'abc': '0002'}, 'locn_end': 71}

However if I remove the named capturing group and update the pattern like the following

[p.Literal('line ') + p.Regex(r'\d+')]+  + p.SkipTo(p.LineEnd().suppress())

I get the following output with the full line

{'locn_start': 55, 'value': ['line ', '0002', ' line 2'], 'locn_end': 71}

In all the cases I am parsing using parse_with_tabs What is the difference between the deprecated method and new method so that it yields different result for same pattern

ptmcg commented 1 year ago

I think you may be using as_dict() to view the contents of the parsed results. as_dict() does not display unnamed elements in the results. Please use the dump() method instead. Pyparsing's run_tests method uses dump() to display the parsed results:

import pyparsing as p

tests = """
    line 0002 line 2
"""

parser = p.Literal('line ') + p.Regex(r'(?P<abc>\d+)') + p.SkipTo(p.LineEnd().suppress())
p.Located(parser).run_tests(tests)
p.locatedExpr(parser).run_tests(tests)

parser = p.Literal('line ') + p.Regex(r'\d+') + p.SkipTo(p.LineEnd().suppress())
p.Located(parser).run_tests(tests)
p.locatedExpr(parser).run_tests(tests)

prints

line 0002 line 2
[0, ['line ', '0002', 'line 2'], 16]
- locn_end: 16
- locn_start: 0
- value: ['line ', '0002', 'line 2']
  - abc: '0002'
[0]:
  0
[1]:
  ['line ', '0002', 'line 2']
  - abc: '0002'
[2]:
  16

line 0002 line 2
[[0, 'line ', '0002', 'line 2', 16]]
[0]:
  [0, 'line ', '0002', 'line 2', 16]
  - abc: '0002'
  - locn_end: 16
  - locn_start: 0
  - value: ['line ', '0002', 'line 2']

line 0002 line 2
[0, ['line ', '0002', 'line 2'], 16]
- locn_end: 16
- locn_start: 0
- value: ['line ', '0002', 'line 2']
[0]:
  0
[1]:
  ['line ', '0002', 'line 2']
[2]:
  16

line 0002 line 2
[[0, 'line ', '0002', 'line 2', 16]]
[0]:
  [0, 'line ', '0002', 'line 2', 16]
  - locn_end: 16
  - locn_start: 0
  - value: ['line ', '0002', 'line 2']

The new Located class is more consistent in how it reports the parsed value, whether it does or does not contain any named items or regex groups.

ptmcg commented 1 year ago

I'm closing this issue, as the change in behavior is as intended. But please feel free to reopen if you have more questions.