Open samer1977 opened 6 months ago
@samer1977 Can you give an example ?
OK, I was trying to solve the problem here: https://github.com/schibsted/jslt/issues/342 using recursive function call as such:
def capture-many(json,regex,key)
let c = capture($json, $regex)
let res = if($c =={}) []
else
[$c]+capture-many(replace($json, get-key($c,$key), ""),$regex,$key)
$res
capture-many(.body,"<img src=\"(?<url>[a-z])\">","url")
To do that I have to get each capture , store in an array , then do the next capture recursively by purging the json through replace with empty string and so until no capture left. I understand that there is limitation where each capture has to be different and you can only have one key-value pair capture which would have worked for this scenario.
The above function would have worked on more simplistic scenario like this:
{ "body" : "<div class=\"intercom-container\"><img src=\"image1\">
}
However once you introduce more complex string like urls then it wont because regex.
Sorry to be a nag, can you give us a challenging example ? Are you talking about a URL that contains any one of these characters:
Yes. Can you make the recursive function above work on the original input without having to escape every regex char ?
{ "body" : "<div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\">
How about this.
input
{ "body": "<div class='intercom-container'><img src=\"image1\"></img></div><div class=\"intercom-container\"><img src=\"image2\"></img></div><div class=\"intercom-container\"><img src=\"image3\" /><div class=\"intercom-container\"><img src=\"image4\"/><img src='^&*image5'/></div>"
}
You have different image tags
<img src='image1'></img>
<img src="image2"></img>
<img src="image3" />
<img src="image4"/>
<img src='^&*image5'/>
Here's the transformation:
[ for (split (.body, "<img ")[1:])
capture (., "^src=\"(?<url>[^\"']+)\"")
]
No need to use recursion.
The trick is to split up the input with the "seperator" <img
.
Yes, there can be any funny characters in the src-attribute, even regexp "reserved" characters. We are capturing only what is in between src=" and the ending double quote.
The result is:
[ {
"url" : "image1"
}, {
"url" : "image2"
}, {
"url" : "image3"
}, {
"url" : "image4"
}, {
"url" : "^&*image5"
}
]
How did I come to this solution ?
I first only used this
split (.body, "<img ")
This gave me
[
"<div class=\"intercom-container\">",
"src=\"image1\"></img></div><div class=\"intercom-container\">",
"src=\"image2\"></img></div><div class=\"intercom-container\">",
"src=\"image3\" /><div class=\"intercom-container\">",
"src=\"image4\"/>",
"src=\"^&*image5\"/></div>"
]
As you can see, the first element in the array, does not have a "src=" at the beginning. So it has to be excluded, thus changing the transformation to
split (.body, "<img ")[1:]
Now that all elements start with "src=", the regexp just becomes - basically anything betwen the quotes:
capture (., "^src=\"(?<url>[^\"']+)\"")
And now you wrap array processing around it resulting in "[ for ..... capture (...) ]".
First of all, your input is not properly formatted; you should get plenty of errors in the sandbox alone.
You are not properly escaping the double quotes in the body
attribute.
Here's how it should be:
{ "body": "<div class=\"intercom-container\"><img src=\"image1\">. <div class=\"intercom-container\"><img src=\"image2\">"
}
Second, you are not properly regexing. All you need to do is express that you want capture all non-double quotes after <img src=" up-to before the next double quote.
You want to exclude anything that is blue, only capture the orange string.
The part in purple - (?<url>
and )
after the '+' - is only there to tell regexp that you have a capturing group named url
.
So, this below should work.
def capture-many (json, regex, key)
let c = capture ($json, $regex)
let res = if ($c == {}) []
else [$c] + capture-many (replace($json, get-key($c,$key), ""),$regex,$key)
$res
capture-many (.body, "<img src=\"(?<url>[^\"]+)\">", "url")
A few words of advise.
Instead of [a-z]
, which only captures lower-case letters of the alphabet, you have to look at it differently.
What is in: anything orange above, that means, a sequence of 1 or more characters EXCEPT for a double quote. What is out: <img src=" at the beginning, and "...... at the end.
The "in" part is expressed as [^"]+
, this is called a negated character class.
A good resource is https://www.regular-expressions.info/charclass.html
Good luck.
OK! Thanks for your detailed answer. I appreciate it , at least its detailed. I will take everything you said into consideration and try to be careful when posting data\code. I did not pay much attention to what I was pasting because I made it clear early on that this all based on this: https://github.com/schibsted/jslt/issues/342 and that should have been your source. No excuse though I will try and do better next time. I'm using all the above and I understand regex very well but sorry Im still human.
We are all here to learn from each other.
I created a PR, see #350.
no doubt having the regex replace is very powerful but sometimes you want to do simple a literal string replace.Where I encountered a problem is when I wanted to replace literal string that contains regex chars. I could not find an easy way to do that but having to replace all regex reserved char first to escape them and that can get cumbersome and inefficient.