nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.05k stars 152 forks source link

Disable rules/paranthesized header #38

Open mikecook69 opened 6 months ago

mikecook69 commented 6 months ago

I found the PARENTHESIZED_HDR regex is causing problems.

E.g., given the text

You can always check the car's manual if you're stuck.

(The manual should be located in the glove box.)

Otherwise please call for help.

Then the line (the manual) is being marked as a header. I disabled the PARENTHESIZED_HDR regex, because it didn't seem useful, but maybe there could be a config file to disable rules like this one?