technosophos / dashing

A Dash Generator Script for Any HTML
Other
690 stars 55 forks source link

Regex negative lookahead not supported #62

Open leothelocust opened 2 years ago

leothelocust commented 2 years ago

TL;DR

I need to improve the results of a specific css selector that returns items I don't want based on a word in the captured text. In this case "release". I don't want anything with that word in the source documentation to be added as a selector.

A bit wordier

In the source documentation, my selector not only returns a nice list of "Sections", but it also returns about 200 "release notes" sections, with the same css query selector.

Essentially I have a bunch of these I want to get rid of:

2020 Release Notes
2019 Winter Release Notes
Upgrade release-notes for xyz

I don't want those to be included in the resulting docset, so I tried my hand at the regex field to return everything not including the word release:

I essentially need the opposite of this:

^.*release.*$

So, don't return anything that has the word "release" in it.

I tried the (?!) negative lookahead in regex, but I get the message:

error parsing regexp: invalid or unsupported Perl syntax: `(?!`

Is there a field in the selector object for rejecting if the title contains a word? I didn't see anything for this purpose in the README:

"css selector": {
      "requiretext": "require that the text matches a regexp. If not, this node is not considered as selected",
      "type": "Dash data type",
      "attr": "Use the value of the specified attribute instead of html node text as the basis for transformation",
      "regexp": "PCRE regular expression (no need to enclose in //)",
      "replacement": "Replacement text for each match of 'regexp'",
      "matchpath": "Only files matching this regular expression will be parsed. Will match all files if not set."
}
stevenkaras commented 1 year ago

Dashing doesn't support PCRE. It supports golang regexp - which in turn is based on re2, which does not support lookaround.

I've hit something similar and worked around it by adding a match like this:

{
    "type": "Guide",
    "matchpath": "foobar/([^r]|r[^e]).*\\."
}