scripting / Scripting-News

I'm starting to use GitHub for work on my blog. Why not? It's got good communication and collaboration tools. Why not hook it up to a blog?
121 stars 10 forks source link

I suck at regular expressions #215

Open scripting opened 3 years ago

scripting commented 3 years ago

I'm like an NBA player who for some reason can't shoot foul shots. Only with me it's regular expressions. I used to do them all the time. Now I'm no good at it for some reason. Please help. Thanks.

I want to find a url like this in a string:

https://t.co/B8vzgwK5iX

An example of its usage:

How to Wash Fruits and Vegetables. https://t.co/ROcD5Tsz4f #giftlink

In other words the substring will always begin with https://t.co/, will be followed by a string of uppercase and lowercase alpha and numeric characters, followed by the end of the string or a whitespace character.

I need to be able to get the actual url and be able to easily replace it with another string.

Thanks! :smile:

pdrlps commented 3 years ago

This may be too general, but /https:\/\/t.co\/\S+/gm could work?

Test here: https://regex101.com/r/3GlcXo/1

scotthansonde commented 3 years ago

You probably want to escape the dot between the t and co so you don't match https://taco/ πŸ˜ƒ , i.e. /https:\/\/t\.co\/\S+/gm (Edited because I first said 'https://taco.com/', which is actually not matched πŸ˜› )

scripting commented 3 years ago

I'm using @papascott's pattern, it's good -- but needs to a bit smarter.

Suppose I delete the space between the url and #giftlink

The pattern as is, will say that the url is https://t.co/ROcD5Tsz4f#giftlink

But # is not one of the possible characters in the url. It can contain alpha and numeric characters only.

gruber commented 3 years ago

/https:[/]{2}t[.]co[/][a-z0-9]+/igm should fix the issue with #’s.

Should match the literal text β€œhttps://t.co/β€œ + one or more alphabetic/numerals and stop. The /i switch makes it case insensitive, so we don’t have to specify both a-z and A-Z. I also put the characters that need escaping β€” / and . β€” in character classes instead of backslashing them. I find backslashes hard to read.

brianryer commented 3 years ago

t.co links are always 23 characters, so http?s?:\/\/(t\.co\/.{10}) should find any t.co link in the text at which you point this regex. And, actually, you probably always only want the secure protocol, so https:\/\/(t\.co\/.{10}) should do it, assuming, as we will, that no friction will derive from the particular regex flavor in your implementation.

https://regex101.com/r/6WNTvi/1 (the example here matches both http and https urls.)

scripting commented 3 years ago

Thank you all -- this one works.

image

Here's a snippet of code that illustrates how it works inside the app.

function replaceTcoLinkInText (callback) {
    var text = attsForNewNode.text;
    var pattern = /https:[/]{2}t[.]co[/][a-z0-9]+/igm
    var result = pattern.exec (text);
    if (result == null) {
        callback ();
        }
    else {
        var url = result [0];
        text = utils.stringDelete (text, result.index + 1, url.length);
        derefUrl (url, function (err, urlDeref) {
            if (!err) {
                var linkForText = "<a href=\"" + urlDeref + "\">" + "[link]" + "</a>";
                text = utils.stringInsert (linkForText, text, result.index);
                attsForNewNode.text = text;
                }
            callback ();
            });
        }
    }
gruber commented 3 years ago

t.co links are always 23 characters, […]

I don't know if that's a safe assumption forever. Better not to hardcode a limit now that someday might grow to 24 characters.

stanwaring commented 3 years ago

Dave, Regards your June 30 and July 1st bog post about t.co I happened across this twitter post. I don't know if it could be related to the solution you are looking for. https://twitter.com/workbenchdata/status/1093570886500679680?s=20 The work bench table is here https://app.workbenchdata.com/workflows/8961/ Note the Regex extractors in tab 1. Stan

scripting commented 3 years ago

The problem is solved. See note above.

scripting commented 3 years ago

I have a regular expression that matches a double-square bracket tag. For example:

Scripting News is the best [[blog]] in the world.

Here's the expression that I think does it.

/\[\[.*\]\]/

So here's a question for the regex experts...

Will this work?

brianryer commented 3 years ago

It will, but it will also grab occurrences of more than two square brackets. You can limit that with a quantifier in curly braces, e.g. {2}

jsit commented 3 years ago

Kind of -- it would also match all of [[blog]] in the [[world]]. You might need to do something like:

\[{2}[^\[^\]]*\]{2}

scripting commented 3 years ago

@jsit -- thank you.

i'm working on the JS code to drive this, and it's working, but only returning the first match.

so i tried adding a gm at the end, but that didn't seem to make a difference.

/\[{2}[^\[^\]]*\]{2}/gm

Here's the actual code.

function processText (theText) {
   var pattern = /\[{2}[^\[^\]]*\]{2}/gm;
   var result = pattern.exec (theText);
   etc...
   }

And this is the text.

Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs.

brianryer commented 3 years ago

Matches anything in double square brackets and returns the contents thereof.

https://regex101.com/r/u48ew4/1

AriT93 commented 3 years ago

I think you want to return result[0] instead of theText in your function. I think the regex is correct


function processText (theText) {
...    var pattern = /\[{2}[^\[^\]]*\]{2}/gm;
...    var result = pattern.exec (theText);
...    return result;
...    }
> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I co")
processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publ> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publi cly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the  press started asking them. Maybe they had a developer story? Or maybe I co")
[ '[[RSS]]',
  index: 15,
  input: "Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I co",
  groups: undefined ]
> ```
jsit commented 3 years ago
function processText(theText) {
  var pattern = /\[{2}.*?\]{2}/g;
  return theText.match(pattern);
}

console.log(processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs."));

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match

scripting commented 3 years ago

the problem was it was only finding the first match. i should not have left the return statement in there because it was confusing to some. you apparently thought that was the question i was asking.

to be clear -- i need all the matches, not just the first.

brianryer commented 3 years ago

This matches all pairs of double square brackets and returns groups with the contents therein (between the [['s and ]]'s):

 /\[{2}(.*?)\]{2}/gm

At least, it does with your sample text here: https://regex101.com/r/Gm4Lyy/1

But maybe the issues is w/ javascript...

jsit commented 3 years ago

@scripting My code should return an array of all the matches

AriT93 commented 3 years ago

ahh right. @jsit sample does work. I also missed that your original regex is missing a capture group .

poc from @jsit

> function processText (theText) {
...     var pattern = /\[{2}(.*?)\]{2}/gm;
...     return theText.match(pattern);
...     }
undefined
>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announce>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced  publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions  before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pr>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pre tty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centraliz>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralize d a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends a>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at  the company, people from the blogging world and from other RSS devs.");
[ '[[RSS]]', '[[FeedBurner]]' ]

ETA: your original regex would be \[{2}([^\[^\]]*)\]{2}


> function processText (theText) {
...     var pattern = /\[{2}([^\[^\]]*)\]{2}/gm; 
...     return theText.match(pattern);
...     }
undefined
>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announce>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced  publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions  before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pr>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pre tty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centraliz>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralize d a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends a>  processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at  the company, people from the blogging world and from other RSS devs.");
[ '[[RSS]]', '[[FeedBurner]]' ]```
brianryer commented 3 years ago

@AriT93 why the complicated negation string? ([^[^]])] I mean, what benefit is derived as opposed to the relatively easier to read (.?)

btw, not asking to be a smart-ass, I really like learning more elegant and powerful ways to cast my regex spells :-)

AriT93 commented 3 years ago

@brianryer only for completeness in the example to show the original regex still worked once the capture group was added.

also @scripting this might also be useful.

pattern.exec expects you to loop over the matches while String.match. gives you an array of all matches

scripting commented 3 years ago

String.match doesn't tell you where the matches are, just what the matched strings are. At least that's what happens when I run it here. I'm doing this not just to find out what the patterns are (though i can see where that would be useful) -- I need to know where they are so I can do the substituting.

scripting commented 3 years ago

OK thanks to all the help I think I have it working...

function processText (theText) {
    var pattern = /\[{2}[^\[^\]]*\]{2}/gm; //looks for double-square bracketed tags
    while (true) {
        var result = pattern.exec (theText);
        if (result == null) {
            break;
            }
        var matchstring = result [0];
        theText = stringDelete (theText, result.index + 1, matchstring.length);
        theText = stringInsert ("xxx", theText, result.index);
        }
    return (theText);
    }

Note that substituting with xxx is not what this is about, it's just where I'm stopping for the moment. :smile:

AriT93 commented 3 years ago

in that case looks like you want to do /gmd. to get the indices in the results that you get then loop over the exec until it stops returning anything.

something like

function processText (theText) {
     var pattern = /\[{2}([^\[^\]]*)\]{2}/gmd; 
     while(result = pattern.exec(theText)) {
         get the index of the match here to substitue
     }
brianryer commented 3 years ago

Thanks @AriT93 and @scripting for insights on js and regex interplay.

and I noticed that both of the expressions below fail to disregard blank square-bracket pairs...

/\[{2}(.*?)\]{2}/gm
/\[{2}([^\[^\]]*)\]{2}/gmd

...but this one does ignore them, and also allows for the content to contain spaces.

/\[{2}([^\[^\]^\s].*?)\]{2}/gmd
brianryer commented 3 years ago

Because I was thinking about wiki style links I read a little bit about the syntax wikipedia uses and learned that whitespace is trimmed from both ends, but allowed otherwise; I thought it prudent to do the same for cases where 'raw' wiki style markup text is being processsed.

It still ignores "blank" links, i.e. square braket pairs with only whitespace content.

https://regex101.com/r/DCalRn/1

/\[{2}\s?([^\s].*?)\s*\]{2}/gmd
scripting commented 3 years ago

@brianryer β€” excellent. These are wiki style links. I will use what you’ve learned. Thanks. πŸ˜€

scripting commented 3 years ago

As promised here's the feature I was working on. Thanks for all the help! :smile:

http://scripting.com/2021/07/22/135636.html?title=taggingInScriptingNews

scripting commented 2 years ago

My next Regex query, for a feature I'm going to announce on Scripting News shortly, possibly later today.

  1. I'm doing this in JavaScript running in the browser.

  2. For an arbitrary paragraph on Scripting News.

  3. I want to highlight every occurrence of the word Michigan by enclosing it in a &ltspan>, like this:

    <span class="spHighlightedText">Michigan</span>

  4. But not when it occurs within an HTML element, like this:

    <a href="http://michigan.com/index.html">Yo!</a>

  5. I have it working if I strip the markup from the text before processing but I want to see if I can do it without stripping the markup. I'm also prepared to write custom JS code without using Regex.

  6. Screen shot. This is what the result would look like except the links and styling would not be stripped.

scripting commented 2 years ago

I wrote some JS code that does this without using regex. I'm going to try integrating it now.

scripting commented 2 years ago

Bing! image

ttepasse commented 2 years ago

Small aside: HTML has a dedicated element for marking up highlights for some years now, the mark element, which is more appropriate that a random span. Supported in all main browsers.

scripting commented 2 years ago

Another challenge for the regex crew! :-)

I have a JavaScript function called stripMarkup that removes all the HTML from a string.

function stripMarkup (s) {
    return (s.replace (/(<([^>]+)>)/ig, ""));
    }

I use it everywhere. But now I'd like something that does a bit less. It would strip all HTML but one element, say a <p> and of course </p>. What would that look like?

Then one more improvement, a function that takes a list of elements that it leaves in place, so I could leave in place all <a>'s and <p>'s.

I will of course share the result for all to use.

Thanks!

PS: This is for a Node app, server app.

brentashley commented 2 years ago

I think my approach would be to replace the items to preserve with tokens that won't be affected by the stripMarkup function, and then restore them afterwards. I parameterized the uniqueString so it can be specified to avoid collisions with original string.

This code is not tested



strippedString = stripMarkupExcept( myHtmlString, ['<p>','</p','<a>','</a>'], '~+~' );

function stripMarkupExceptList( targetString, itemList, uniquePrefix){
  var workString = preserveTokens( targetString, itemList, uniquePrefix );
  workString = stripMarkup( workString );
  workString = restoreTokens( workString, itemList, uniquePrefix );
  return workString;
  }

function preserveTokens( targetString, itemList, uniquePrefix ){
  var workString = targetString
  for( var index=0; index++; index < itemList.length){
    workString = workString.replace(itemList[index], uniquePrefix + index) 
    }
  return workString 
  }

function restoreTokens( targetString, itemList, uniquePrefix ){
  var workString = targetString
  for( var index=0; index++; index < itemList.length){
    workString = workString.replace(uniquePrefix + index, itemList[index]) 
    }
  return workString 
  }

function stripMarkup( s ){
  return (s.replace (/(<([^>]+)>)/ig, ""));
  }
scripting commented 2 years ago

@brentashley -- I considered that and might still go that way if there's no regex approach.

mcenirm commented 2 years ago
/(<(?!\/?(a|p)\b)[^>]*>)/ig

Rough explanation:

Regex in action: https://regex101.com/r/UULmUw/1

Edit: The \b means that a valid hyphenated tag (eg, <p-whatever ...>) would trick it.

scotthansonde commented 2 years ago

Another approach would be to parse the HTML into a JavaScript object, then manipulate the object to keep the tags you want to keep. But what do I know? :-)

scripting commented 2 years ago

@scotthansonde -- this is happening on the server, so there is no HTML parser.

I really want to stick with regex, it's been used in this role for many years, really well burned in.

scripting commented 2 years ago

@mcenirm -- thanks! :-)

Trying to read the text without paragraph breaks was driving me crazy. ;-)

Now I have to find a good way to test this against real world data, the item-level descriptions in feeds.

brijwhiz commented 2 years ago
/(<(?!\/?(a|p)\b)[^>]*>)/ig

Rough explanation:

  • <(?!...) - match < unless followed by ... (aka "negative lookahead")
  • \/?(...)\b - ... a tag from the enclosed list

Regex in action: https://regex101.com/r/UULmUw/1

Edit: The \b means that a valid hyphenated tag (eg, <p-whatever ...>) would trick it.

Thank you, that's a new thing I learned today.

kwebble commented 2 years ago

It's not a regex, but the package sanitize-html can do this.

I have used it as one of he steps to clean up the description of RSS items to only keep a predefined set of HTML tags and attributes..

scripting commented 2 years ago

@kwebble -- thanks! that looks perfect and of course that's exactly the application I have in mind.

If you want to share any of the code you use to call it, including the predefined set of HTML tags you use, that would be welcome.

Update: I just skimmed the docs and it looks pretty simple and straightforward. ;-)

scripting commented 2 years ago

I'm using sanitize-html in my application. It's deployed and seems to be working well. ;-)

Here's a blog post explaining.

http://scripting.com/2022/09/11/145550.html?title=newStrategyForFeedText

scripting commented 2 years ago

Here's a link that opens in Drummer. It contains the package.json, a template file and the JS code that reads an RSS feed and produces an HTML rendering of the items in the feed after running the description text through sanitize-html with the p's removed. Once written it made it easy for me to try out a lot of different combinations for the options, but I settled on the simplest.

http://drummer.scripting.com/?url=http://scripting.com/publicfolder/misc/stripMarkupInFeedItems/source.opml

It uses the reallySimple package to read the feed, and is thus a good demonstration of how easy that is.

I usually would have done the work to put it into a GitHub repo, but this is a lot easier for me, because this is how I work on my code.

If people can read the code in the outliner (you can!), that will make it possible for me to share a lot more code. :-)