Open scripting opened 3 years ago
This may be too general, but /https:\/\/t.co\/\S+/gm
could work?
Test here: https://regex101.com/r/3GlcXo/1
You probably want to escape the dot between the t and co so you don't match https://taco/ π , i.e. /https:\/\/t\.co\/\S+/gm
(Edited because I first said 'https://taco.com/', which is actually not matched π )
I'm using @papascott's pattern, it's good -- but needs to a bit smarter.
Suppose I delete the space between the url and #giftlink
The pattern as is, will say that the url is https://t.co/ROcD5Tsz4f#giftlink
But # is not one of the possible characters in the url. It can contain alpha and numeric characters only.
/https:[/]{2}t[.]co[/][a-z0-9]+/igm
should fix the issue with #βs.
Should match the literal text βhttps://t.co/β + one or more alphabetic/numerals and stop. The /i
switch makes it case insensitive, so we donβt have to specify both a-z
and A-Z
. I also put the characters that need escaping β /
and .
β in character classes instead of backslashing them. I find backslashes hard to read.
t.co links are always 23 characters, so http?s?:\/\/(t\.co\/.{10})
should find any t.co link in the text at which you point this regex. And, actually, you probably always only want the secure protocol, so https:\/\/(t\.co\/.{10})
should do it, assuming, as we will, that no friction will derive from the particular regex flavor in your implementation.
https://regex101.com/r/6WNTvi/1 (the example here matches both http and https urls.)
Thank you all -- this one works.
Here's a snippet of code that illustrates how it works inside the app.
function replaceTcoLinkInText (callback) {
var text = attsForNewNode.text;
var pattern = /https:[/]{2}t[.]co[/][a-z0-9]+/igm
var result = pattern.exec (text);
if (result == null) {
callback ();
}
else {
var url = result [0];
text = utils.stringDelete (text, result.index + 1, url.length);
derefUrl (url, function (err, urlDeref) {
if (!err) {
var linkForText = "<a href=\"" + urlDeref + "\">" + "[link]" + "</a>";
text = utils.stringInsert (linkForText, text, result.index);
attsForNewNode.text = text;
}
callback ();
});
}
}
t.co links are always 23 characters, [β¦]
I don't know if that's a safe assumption forever. Better not to hardcode a limit now that someday might grow to 24 characters.
Dave, Regards your June 30 and July 1st bog post about t.co I happened across this twitter post. I don't know if it could be related to the solution you are looking for. https://twitter.com/workbenchdata/status/1093570886500679680?s=20 The work bench table is here https://app.workbenchdata.com/workflows/8961/ Note the Regex extractors in tab 1. Stan
The problem is solved. See note above.
I have a regular expression that matches a double-square bracket tag. For example:
Scripting News is the best [[blog]] in the world.
Here's the expression that I think does it.
/\[\[.*\]\]/
So here's a question for the regex experts...
Will this work?
It will, but it will also grab occurrences of more than two square brackets. You can limit that with a quantifier in curly braces, e.g. {2}
Kind of -- it would also match all of [[blog]] in the [[world]]
. You might need to do something like:
\[{2}[^\[^\]]*\]{2}
@jsit -- thank you.
i'm working on the JS code to drive this, and it's working, but only returning the first match.
so i tried adding a gm at the end, but that didn't seem to make a difference.
/\[{2}[^\[^\]]*\]{2}/gm
Here's the actual code.
function processText (theText) {
var pattern = /\[{2}[^\[^\]]*\]{2}/gm;
var result = pattern.exec (theText);
etc...
}
And this is the text.
Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs.
Matches anything in double square brackets and returns the contents thereof.
I think you want to return result[0] instead of theText in your function. I think the regex is correct
function processText (theText) {
... var pattern = /\[{2}[^\[^\]]*\]{2}/gm;
... var result = pattern.exec (theText);
... return result;
... }
> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I co")
processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publ> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publi cly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I co")
[ '[[RSS]]',
index: 15,
input: "Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I co",
groups: undefined ]
> ```
function processText(theText) {
var pattern = /\[{2}.*?\]{2}/g;
return theText.match(pattern);
}
console.log(processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs."));
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match
the problem was it was only finding the first match. i should not have left the return statement in there because it was confusing to some. you apparently thought that was the question i was asking.
to be clear -- i need all the matches, not just the first.
This matches all pairs of double square brackets and returns groups with the contents therein (between the [['s and ]]'s):
/\[{2}(.*?)\]{2}/gm
At least, it does with your sample text here: https://regex101.com/r/Gm4Lyy/1
But maybe the issues is w/ javascript...
@scripting My code should return an array of all the matches
ahh right. @jsit sample does work. I also missed that your original regex is missing a capture group .
poc from @jsit
> function processText (theText) {
... var pattern = /\[{2}(.*?)\]{2}/gm;
... return theText.match(pattern);
... }
undefined
> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announce> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pr> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pre tty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centraliz> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralize d a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends a> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs.");
[ '[[RSS]]', '[[FeedBurner]]' ]
ETA: your original regex would be \[{2}([^\[^\]]*)\]{2}
> function processText (theText) {
... var pattern = /\[{2}([^\[^\]]*)\]{2}/gm;
... return theText.match(pattern);
... }
undefined
> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announce> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pr> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pre tty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centraliz> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralize d a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends a> processText("Long ago, when [[RSS]] was starting to boom, I was often surprised when a new product came out and I didn't hear about it until it was announced publicly. The one I remember best was [[FeedBurner]]. You'd think that they would want my endorsement, and to give them a chance to answer technical questions before the press started asking them. Maybe they had a developer story? Or maybe I could have helped with a design decision? At that time Scripting News was pretty well-read in the developer community. It always felt like they must have been hiding something, but if they were I never found it. Yes FeedBurner centralized a technology that was good because it was decentralized, but I don't think my saying that would have hurt them. What was even more strange is I had friends at the company, people from the blogging world and from other RSS devs.");
[ '[[RSS]]', '[[FeedBurner]]' ]```
@AriT93 why the complicated negation string? ([^[^]])] I mean, what benefit is derived as opposed to the relatively easier to read (.?)
btw, not asking to be a smart-ass, I really like learning more elegant and powerful ways to cast my regex spells :-)
@brianryer only for completeness in the example to show the original regex still worked once the capture group was added.
also @scripting this might also be useful.
pattern.exec expects you to loop over the matches while String.match. gives you an array of all matches
String.match doesn't tell you where the matches are, just what the matched strings are. At least that's what happens when I run it here. I'm doing this not just to find out what the patterns are (though i can see where that would be useful) -- I need to know where they are so I can do the substituting.
OK thanks to all the help I think I have it working...
function processText (theText) {
var pattern = /\[{2}[^\[^\]]*\]{2}/gm; //looks for double-square bracketed tags
while (true) {
var result = pattern.exec (theText);
if (result == null) {
break;
}
var matchstring = result [0];
theText = stringDelete (theText, result.index + 1, matchstring.length);
theText = stringInsert ("xxx", theText, result.index);
}
return (theText);
}
Note that substituting with xxx is not what this is about, it's just where I'm stopping for the moment. :smile:
in that case looks like you want to do /gmd. to get the indices in the results that you get then loop over the exec until it stops returning anything.
something like
function processText (theText) {
var pattern = /\[{2}([^\[^\]]*)\]{2}/gmd;
while(result = pattern.exec(theText)) {
get the index of the match here to substitue
}
Thanks @AriT93 and @scripting for insights on js and regex interplay.
and I noticed that both of the expressions below fail to disregard blank square-bracket pairs...
/\[{2}(.*?)\]{2}/gm
/\[{2}([^\[^\]]*)\]{2}/gmd
...but this one does ignore them, and also allows for the content to contain spaces.
/\[{2}([^\[^\]^\s].*?)\]{2}/gmd
Because I was thinking about wiki style links I read a little bit about the syntax wikipedia uses and learned that whitespace is trimmed from both ends, but allowed otherwise; I thought it prudent to do the same for cases where 'raw' wiki style markup text is being processsed.
It still ignores "blank" links, i.e. square braket pairs with only whitespace content.
https://regex101.com/r/DCalRn/1
/\[{2}\s?([^\s].*?)\s*\]{2}/gmd
@brianryer β excellent. These are wiki style links. I will use what youβve learned. Thanks. π
As promised here's the feature I was working on. Thanks for all the help! :smile:
http://scripting.com/2021/07/22/135636.html?title=taggingInScriptingNews
My next Regex query, for a feature I'm going to announce on Scripting News shortly, possibly later today.
I'm doing this in JavaScript running in the browser.
For an arbitrary paragraph on Scripting News.
I want to highlight every occurrence of the word Michigan by enclosing it in a <span>, like this:
<span class="spHighlightedText">Michigan</span>
But not when it occurs within an HTML element, like this:
<a href="http://michigan.com/index.html">Yo!</a>
I have it working if I strip the markup from the text before processing but I want to see if I can do it without stripping the markup. I'm also prepared to write custom JS code without using Regex.
Screen shot. This is what the result would look like except the links and styling would not be stripped.
I wrote some JS code that does this without using regex. I'm going to try integrating it now.
Bing!
Small aside: HTML has a dedicated element for marking up highlights for some years now, the mark
element, which is more appropriate that a random span
. Supported in all main browsers.
Another challenge for the regex crew! :-)
I have a JavaScript function called stripMarkup that removes all the HTML from a string.
function stripMarkup (s) {
return (s.replace (/(<([^>]+)>)/ig, ""));
}
I use it everywhere. But now I'd like something that does a bit less. It would strip all HTML but one element, say a <p> and of course </p>. What would that look like?
Then one more improvement, a function that takes a list of elements that it leaves in place, so I could leave in place all <a>'s and <p>'s.
I will of course share the result for all to use.
Thanks!
PS: This is for a Node app, server app.
I think my approach would be to replace the items to preserve with tokens that won't be affected by the stripMarkup function, and then restore them afterwards. I parameterized the uniqueString so it can be specified to avoid collisions with original string.
This code is not tested
strippedString = stripMarkupExcept( myHtmlString, ['<p>','</p','<a>','</a>'], '~+~' );
function stripMarkupExceptList( targetString, itemList, uniquePrefix){
var workString = preserveTokens( targetString, itemList, uniquePrefix );
workString = stripMarkup( workString );
workString = restoreTokens( workString, itemList, uniquePrefix );
return workString;
}
function preserveTokens( targetString, itemList, uniquePrefix ){
var workString = targetString
for( var index=0; index++; index < itemList.length){
workString = workString.replace(itemList[index], uniquePrefix + index)
}
return workString
}
function restoreTokens( targetString, itemList, uniquePrefix ){
var workString = targetString
for( var index=0; index++; index < itemList.length){
workString = workString.replace(uniquePrefix + index, itemList[index])
}
return workString
}
function stripMarkup( s ){
return (s.replace (/(<([^>]+)>)/ig, ""));
}
@brentashley -- I considered that and might still go that way if there's no regex approach.
/(<(?!\/?(a|p)\b)[^>]*>)/ig
Rough explanation:
<(?!...)
- match <
unless followed by ... (aka "negative lookahead")\/?(...)\b
- ... a tag from the enclosed listRegex in action: https://regex101.com/r/UULmUw/1
Edit: The \b
means that a valid hyphenated tag (eg, <p-whatever ...>
) would trick it.
Another approach would be to parse the HTML into a JavaScript object, then manipulate the object to keep the tags you want to keep. But what do I know? :-)
@scotthansonde -- this is happening on the server, so there is no HTML parser.
I really want to stick with regex, it's been used in this role for many years, really well burned in.
@mcenirm -- thanks! :-)
Trying to read the text without paragraph breaks was driving me crazy. ;-)
Now I have to find a good way to test this against real world data, the item-level descriptions in feeds.
/(<(?!\/?(a|p)\b)[^>]*>)/ig
Rough explanation:
<(?!...)
- match<
unless followed by ... (aka "negative lookahead")\/?(...)\b
- ... a tag from the enclosed listRegex in action: https://regex101.com/r/UULmUw/1
Edit: The
\b
means that a valid hyphenated tag (eg,<p-whatever ...>
) would trick it.
Thank you, that's a new thing I learned today.
It's not a regex, but the package sanitize-html can do this.
I have used it as one of he steps to clean up the description of RSS items to only keep a predefined set of HTML tags and attributes..
@kwebble -- thanks! that looks perfect and of course that's exactly the application I have in mind.
If you want to share any of the code you use to call it, including the predefined set of HTML tags you use, that would be welcome.
Update: I just skimmed the docs and it looks pretty simple and straightforward. ;-)
I'm using sanitize-html in my application. It's deployed and seems to be working well. ;-)
Here's a blog post explaining.
http://scripting.com/2022/09/11/145550.html?title=newStrategyForFeedText
Here's a link that opens in Drummer. It contains the package.json, a template file and the JS code that reads an RSS feed and produces an HTML rendering of the items in the feed after running the description text through sanitize-html with the p's removed. Once written it made it easy for me to try out a lot of different combinations for the options, but I settled on the simplest.
It uses the reallySimple package to read the feed, and is thus a good demonstration of how easy that is.
I usually would have done the work to put it into a GitHub repo, but this is a lot easier for me, because this is how I work on my code.
If people can read the code in the outliner (you can!), that will make it possible for me to share a lot more code. :-)
I'm like an NBA player who for some reason can't shoot foul shots. Only with me it's regular expressions. I used to do them all the time. Now I'm no good at it for some reason. Please help. Thanks.
I want to find a url like this in a string:
https://t.co/B8vzgwK5iX
An example of its usage:
How to Wash Fruits and Vegetables. https://t.co/ROcD5Tsz4f #giftlink
In other words the substring will always begin with https://t.co/, will be followed by a string of uppercase and lowercase alpha and numeric characters, followed by the end of the string or a whitespace character.
I need to be able to get the actual url and be able to easily replace it with another string.
Thanks! :smile: