mrabarnett / mrab-regex

Other
431 stars 49 forks source link

Alternative for variable length lookbehind. #491

Closed dsinghsf closed 1 year ago

dsinghsf commented 1 year ago

Hi Team

I have list of files, each file have this format:

<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

What i am trying to achieve is to replace the text inside tag2 with specific ip of tag1. Have tried lookbehind, but it is only able to cover the tag1 with ip:

(?<=(<tag1 ip1>)(.|\n)*)(<tag2 \*>)(.|\n)*.*</tag2>

but it is also matching the text inside tag1 with the tag2 and while replacment, it is removing those text. Since the text inside tag1 is of variable length so not able to use the lookbehind as it doesnt support variable length lookbehind.

Please suggest some alternative for this or if i am missing somthing. Thanks

makyen commented 1 year ago

This regular expression package definitely does support variable length lookbehinds. So, it's unclear to me why you're asking here saying that it doesn't or what you're really wanting as a response.

dsinghsf commented 1 year ago

@makyen yeah it doesn't support, but then if you guys help with any workaround for my usecase that i can use ?

makyen commented 1 year ago

@dsinghsf You appear to have read my comment as exactly the opposite of what I wrote. This regex package does support variable-length lookbehinds. We routinely use variable-length lookbehinds with this regular expression engine.

facelessuser commented 1 year ago

Unfortunately, I don't even know what this regex is trying to do.

(?<=()) # this matches nothing, why is this here?
((.|\n)) # Matches a single character or new line, why is it in double groups?
(<tag2 *>) # did you mean to match multiple spaces ` *` or did you mean to match a space and a literal star ` \*`?
(.|\n) # You then match a single char or newline 
.*  # and then as many characters as you can (assuming `DOTALL` is not enabled, it will stop at the first end of a line)

The question was posed as to why regex doesn't support variable lookbehinds, but from what I'm seeing, you aren't applying any useful lookbehinds and mistaking this to be a regex issue. It looks like a user logic issue.

mrabarnett commented 1 year ago

You haven't said precisely what you expect the output to be, so I'm guessing.

Search: (<tag1 (\w+)>[^<]*<tag2 \*>)[^<]*(</tag2>) Replace: \1\2\3

Oh, and who is the "team" of which you speak? :-)

dsinghsf commented 1 year ago

@makyen oh my bad! Yeah i read your comment incorrectly. Also, i just realize that in our company enviorment it is not a good practice to use the thired party library, so the security team has block the access of such websites. So i would be not abel to use this library, but any other suggestion that i can use with the python re module with, that would be helpfull. TIA

dsinghsf commented 1 year ago

@facelessuser so seems like add a quote feature removes the tags, thats why it was not showing. Have fixed it now, you can go through the regex. As i mentioned above, i will be not able to use this module. If you can help with in-build re module of python that would be helpfull

dsinghsf commented 1 year ago

@mrabarnett Have fixed the regex as add a quote feature of github was removing the tags. Please go through it. Also, for this repo i thought it is maintained by set of people, hence chose to address like that :)

mrabarnett commented 1 year ago

My answer doesn't require any of the additional features of this module; it'll work with the re module.

facelessuser commented 1 year ago

EDIT: I missed where there was an official response that already suggested an approach. Oh, well.

Just to clarify, variable lookbehinds do work. I have a tool built upon this library. To visualize, here is a screenshot of the tool using the pattern:

(?<=<tag1 ip1>(?:.|\n)*)(<tag2 \*>(?:.|\n)*?</tag2>)

You can see it finds and targets the appropriate tags as expected. Just in case there is still doubt.

Screenshot 2023-02-28 at 10 17 18 AM
dsinghsf commented 1 year ago

@facelessuser This looks great, but i want only the tag2 match only for tag1 ip1 . So the match should be only

  <tag2 *>
      sometext 
  </tag2>

out of below

<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

Please suggest if i am missing somthing to achieve that?

mrabarnett commented 1 year ago

I'm still unclear as to what you mean. It would help if you could write explicitly what the output text should be for the given input.

facelessuser commented 1 year ago

@dsinghsf Sorry, that is because my pattern was too relaxed, something like what @mrabarnett suggested (using [^<]* instead of (?:.|\n)*) is what you really need. I'm only showing the match part as I still don't understand your replace part:

(?<=<tag1 ip1>[^>]*)(<tag2 \*>(?:.|\n)[^>]*</tag2>)
Screenshot 2023-03-02 at 5 41 39 AM

As an aside, while you can use Regex to find things in HTML, HTML is not a regular language, and Regex is not always the best tool to manipulate HTML with. Personally, if I am doing advanced HTML editing, I would probably just use something like Beautiful Soup. Though, the HTML you are showing does look odd with attributes like *, unless that is meant only as an illustration in this example.

A library like Beautiful Soup is well suited for finding HTML nodes, removing them, adding new ones, getting at attributes, etc. That's not to say you can't use Regex, but the more complicated the task in HTML, the more complex and specific the required pattern will be in regex. For instance, the Regex pattern above assumes you have no HTML attributes with > in their value, but if they do, then it won't work as well unless they happen to be using HTML entities.

from bs4 import BeautifulSoup
import soupsieve as sv

HTML = """
<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>
"""

soup = BeautifulSoup(HTML, 'html.parser')
print(soup.select('tag1[ip1] > tag2'))

Output

[<tag2 *="">
      sometext
  </tag2>]
dsinghsf commented 1 year ago

@facelessuser Thanks, the regex you suggested works! About your suggestions to use any library to parse these xml tags(they are xml tags), the things are these xml files are not proper xml and as you mentioned it 's attributes are not proper as (<tag1 *>). And there are many legacy files that are there which needs these changes. So Can't use the library for now. But your suggestion works. Thanks a lot!

@mrabarnett this is what i was looking for which @facelessuser has suggested, thanks for your all the help too. :)

dsinghsf commented 1 year ago

Hi @facelessuser Need one help again with a regex, have tried many things but didn't get to work. So, my requirement is i have same file

<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  <tag2 *>
      sometext
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

what i want is a regex which we provide ip, it should add a new child tag, tag3 just before the end of the parent tag. For eg, given ip as ip2, the regex should match and add tag3 to the above input file as:

<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  <tag2 *>
      sometext
  </tag2>
  sometext with multiple lines goes here
  <tag3>
     sometext that goes here
  </tag3>
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>

Thanks.

mrabarnett commented 1 year ago

Try this:

import regex

text = '''
<tag1 ip1>
  <tag2 *>
      sometext 
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip2>
  <tag2 *>
      sometext
  </tag2>
  sometext with multiple lines goes here
</tag1>

<tag1 ip3>
  sometext with multiple lines goes here
  <tag2 *>
      sometext
  </tag2>
</tag1>
'''

search_ip = 'ip2'
new_tag = '  <tag3>\n     sometext that goes here\n  </tag3>\n'

text = regex.sub(r'(?s)<tag1 %s>(?:(?!</tag1>).)*' % search_ip, r'\g<0>%s' % new_tag, text)

print(text)
dsinghsf commented 1 year ago

@mrabarnett thanks for this, btw this is for which python version ?

mrabarnett commented 1 year ago

Any Python version. It'll also work using re.

dsinghsf commented 1 year ago

This works, thanks for the help @mrabarnett . 🎉