thadguidry commented 3 months ago

Problem

Links are currently breaking with an extra whitespace just before the Entity ID such as: https://wikidata.org/entity/<a href="https://wikidata.org/entity/ Q12345" target="_blank"> Q12345</a>

example test page I hacked:

<html>
<body>
"Q1234578" and Q12345678
<p>
some text paragraph with Q12345678 and "Q12345678" in a sentence.
</p>
</body>
</html>

Likely Solution

The regex or code should be improved so that:

match replacement variable itself does not include the leading whitespace as part of the regular expression.

Good to Have

it would also be nice to detect double quoted strings such as "Q12345" and turn the ID into a link as well. For example, in some HTML text like That QID "Q12345" is not in our database it would be nice to see the Q12345 as a hyperlink. Perhaps this could be done directly in the Regex with a pipe | OR condition and capture group(s) ?

thadguidry commented 3 months ago

We might see if we can improve the parsing and link creation by using a library first instead of directly trying to find everything in the body of the HTML through regex. Something like using https://github.com/chishui/JSSoup or bluntly like https://github.com/posthtml/posthtml-parser might help.

ryanrackemann commented 3 months ago

Hey @thadguidry , I'm not sure if I understand the full scope of the question, but would the following Regex work for your goal?

Q\d+

This does not account for the double quotes. To capture leading and/or trailing double quotes as well, you would need to wrap each end of this regex with \"*. To handle the captured text with a double quote and embed it into the href for the a tag, you will need to strip out the double quotes with something like .replace(/"/g, '') when building that string.

Hope this helps!

thadguidry commented 3 months ago

I think the regex just needs to be simplied and use postive lookbehind from my analysis. Given the following scenarios likely encountered in the HTML Wikidata pages:

<p>That QID "Q12345" is not in our database</p>
<span>Q12345</span>
<span> Q12345 </span>
https://wikidata.org/entity/Q12345
 Q123 or 19Q1233 $Q1234 'Q123' `Q1234` <code>Q123</code>+Q123

So this regex seems to work better?

/(?<=[ "'])Q\d+/g

ryanrackemann commented 3 months ago

Maybe so. Based on the example provided, I would imagine you would want the following to match as well which doesn't look like it passes the condition.

Q12345

Could you provide more context of when exactly you are expecting a match and not expecting a match?

Edit: I believe that your solution will work, but may require more elements in the character set based on your use case. Overall, looks good.

ryanrackemann commented 3 months ago

For performance concerns on large pages, you may also want to start by grabbing all instances of a simpler regex and filtering it down.

thadguidry commented 3 months ago

I don't see performance issues with large pages. Apparently, JavaScript engines in browsers are now blazing fast with Regex especially if it is pre-compiled like I'm doing in https://github.com/thadguidry/wikidata-entity-linker/blob/main/linker.js

I do need help with the Q1234 scenario. I tried some variations with the regex, but cannot seem to extract it when including the >< in the character scanning set? Thoughts?

thadguidry commented 3 months ago

Ah, I think I got it...

const regex = /(?<=[ "'(>])Q\d+(?=[ "').<])/g;

ryanrackemann commented 3 months ago

That'll work but you can probably just make it single sided like the following.

(?<=[ "'>])Q\d+

thadguidry commented 3 months ago

I cannot, because then it won't pick up wrapped paratheses as another needed format I spotted...

This one (Q1234) should be merged with (Q5678), I think?

thadguidry commented 3 months ago

Thanks for the guidance @ryanrackemann very helpful !

ryanrackemann commented 3 months ago

Happy to help! Note that your regex is perfect assuming the text must be wrapped on both sides by a combination of those characters. The following seems to pass as well, which I wanted to point out in case it's a concern.

(Q12345<

Also note that cases where it is not wrapped but is still preceded or followed by those characters will not be caught like: Q12345< (Q12345

Also, could you assign us to this issue? Thanks again and let me know if I can help in any other way.

thadguidry commented 3 months ago

Yeah, seems that I'm being too greedy actually still.

<span class="mw-headline" id="Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)_and_video_game_reboot_(Q111223304)"><span data-mw-comment-start="" id="h-Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)-20240117144100"></span>Deprecation of P31=<a href="/wiki/Q65963104" title="Q65963104">video game remaster <small>(Q65963104)</small></a>, <a href="/wiki/Q4393107" title="Q4393107">video game remake <small>(Q4393107)</small></a> and <a href="/wiki/Q111223304" title="Q111223304">video game reboot <small>(Q111223304)</small></a><span data-mw-comment-end="h-Deprecation_of_P31=video_game_remaster_(Q65963104),_video_game_remake_(Q4393107)-20240117144100"></span></span>

In the above span, I shouldn't be matching on any of those wrapped paratheses of QIDs because the paratheses are not wrapped with a set of spaces on either side, as would appear in HTML Text content inside spans themselves having sentences.

my (Q1234) is nice
title="Q111223304"
P31=video_game_remaster_(Q65963104),

Should match on only first Q1234, as an example.

thadguidry commented 3 months ago

Here's one of my testing pages against the extension: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Video_games

thadguidry commented 3 months ago

@ryanrackemann If you could help with making those final few adjustments I would be extremely in your debt!

ryanrackemann commented 3 months ago

I'll take a look and see what I can do. Thanks!

ryanrackemann commented 3 months ago

How does this look?

(?<=>|^|[\^\s("]+)Q\d+(?=[$\s\w)"]*<)

thadguidry commented 3 months ago

This works! Except for the following additional missing constraint:

<a href="/wiki/Q65963104" title="Q65963104">video game remaster (Q65963104)</a>

If a QID such as (Q65963104) or Q65963104 is inside of an <a></a> link, then it should not match. Is there any way to also exclude text content wrapped inside of <a> links so that <a> links are left unchanged?

ryanrackemann commented 3 months ago

I believe this may be a bit too complicated to reasonably use regex. It might be better to piece out the logic and handle these situations separately. I recommend starting by gathering the page content, removing all <a> tags, then processing the text within the nodes to find the instances of the pattern. This may be a bit complicated, but I can't seem to find an effective solution for the regex.

This is just one of many ways to handle this, but it's just the first thing that came to mind. I'll let you know if anything changes or if I find success in the pattern. Best of luck!

thadguidry / wikidata-entity-linker

Regex issue where whitespace character is also added to URL link #1

Problem

Likely Solution

Good to Have