Open thadguidry opened 3 months ago
We might see if we can improve the parsing and link creation by using a library first instead of directly trying to find everything in the body of the HTML through regex. Something like using https://github.com/chishui/JSSoup or bluntly like https://github.com/posthtml/posthtml-parser might help.
Hey @thadguidry , I'm not sure if I understand the full scope of the question, but would the following Regex work for your goal?
Q\d+
This does not account for the double quotes. To capture leading and/or trailing double quotes as well, you would need to wrap each end of this regex with \"*
. To handle the captured text with a double quote and embed it into the href
for the a
tag, you will need to strip out the double quotes with something like .replace(/"/g, '')
when building that string.
Hope this helps!
I think the regex just needs to be simplied and use postive lookbehind from my analysis. Given the following scenarios likely encountered in the HTML Wikidata pages:
<p>That QID "Q12345" is not in our database</p>
<span>Q12345</span>
<span> Q12345 </span>
https://wikidata.org/entity/Q12345
Q123 or 19Q1233 $Q1234 'Q123' `Q1234` <code>Q123</code>+Q123
So this regex seems to work better?
/(?<=[ "'])Q\d+/g
Maybe so. Based on the example provided, I would imagine you would want the following to match as well which doesn't look like it passes the condition.
<span>Q12345</span>
Could you provide more context of when exactly you are expecting a match and not expecting a match?
Edit: I believe that your solution will work, but may require more elements in the character set based on your use case. Overall, looks good.
For performance concerns on large pages, you may also want to start by grabbing all instances of a simpler regex and filtering it down.
I don't see performance issues with large pages. Apparently, JavaScript engines in browsers are now blazing fast with Regex especially if it is pre-compiled like I'm doing in https://github.com/thadguidry/wikidata-entity-linker/blob/main/linker.js
I do need help with the <span>Q1234</span>
scenario. I tried some variations with the regex, but cannot seem to extract it when including the ><
in the character scanning set? Thoughts?
Ah, I think I got it...
const regex = /(?<=[ "'(>])Q\d+(?=[ "').<])/g;
That'll work but you can probably just make it single sided like the following.
(?<=[ "'>])Q\d+
I cannot, because then it won't pick up wrapped paratheses as another needed format I spotted...
This one (Q1234) should be merged with (Q5678), I think?
Thanks for the guidance @ryanrackemann very helpful !
Happy to help! Note that your regex is perfect assuming the text must be wrapped on both sides by a combination of those characters. The following seems to pass as well, which I wanted to point out in case it's a concern.
(Q12345<
Also note that cases where it is not wrapped but is still preceded or followed by those characters will not be caught like:
Q12345<
(Q12345
Also, could you assign us to this issue? Thanks again and let me know if I can help in any other way.
Yeah, seems that I'm being too greedy actually still.
<span class="mw-headline" id="Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)_and_video_game_reboot_(Q111223304)"><span data-mw-comment-start="" id="h-Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)-20240117144100"></span>Deprecation of P31=<a href="/wiki/Q65963104" title="Q65963104">video game remaster <small>(Q65963104)</small></a>, <a href="/wiki/Q4393107" title="Q4393107">video game remake <small>(Q4393107)</small></a> and <a href="/wiki/Q111223304" title="Q111223304">video game reboot <small>(Q111223304)</small></a><span data-mw-comment-end="h-Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)-20240117144100"></span></span>
In the above span, I shouldn't be matching on any of those wrapped paratheses of QIDs because the paratheses are not wrapped with a set of spaces on either side, as would appear in HTML Text content inside spans themselves having sentences.
my (Q1234) is nice
title="Q111223304"
P31=video_game_remaster_(Q65963104),
Should match on only first Q1234, as an example.
Here's one of my testing pages against the extension: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Video_games
@ryanrackemann If you could help with making those final few adjustments I would be extremely in your debt!
I'll take a look and see what I can do. Thanks!
How does this look?
(?<=>|^|[\^\s("]+)Q\d+(?=[$\s\w)"]*<)
This works! Except for the following additional missing constraint:
<a href="/wiki/Q65963104" title="Q65963104">video game remaster <small>(Q65963104)</small></a>
If a QID such as (Q65963104)
or Q65963104
is inside of an <a></a>
link, then it should not match.
Is there any way to also exclude text content wrapped inside of <a>
links so that <a>
links are left unchanged?
I believe this may be a bit too complicated to reasonably use regex. It might be better to piece out the logic and handle these situations separately. I recommend starting by gathering the page content, removing all <a>
tags, then processing the text within the nodes to find the instances of the pattern. This may be a bit complicated, but I can't seem to find an effective solution for the regex.
This is just one of many ways to handle this, but it's just the first thing that came to mind. I'll let you know if anything changes or if I find success in the pattern. Best of luck!
Problem
Links are currently breaking with an extra whitespace just before the Entity ID such as:
https://wikidata.org/entity/<a href="https://wikidata.org/entity/ Q12345" target="_blank"> Q12345</a>
example test page I hacked:
Likely Solution
The regex or code should be improved so that:
match
replacement variable itself does not include the leading whitespace as part of the regular expression.Good to Have
"Q12345"
and turn the ID into a link as well. For example, in some HTML text like<p>That QID "Q12345" is not in our database</p>
it would be nice to see theQ12345
as a hyperlink. Perhaps this could be done directly in the Regex with a pipe|
OR condition and capture group(s) ?