Closed dale-wahl closed 1 year ago
@dale-wahl I just pushed a commit to enhance robustness of url extraction from markdown text. The function should now yield:
[
"https://youtu.be/rLZ2ZzoD-W0",
"https://youtu.be/rLZ2ZzoD-W0?fbclid=IwAR3RdUNf4_yyYxIBbAspDj-86ckbpS9gjv3tn2rhYspmFJuSl_dlkD7AgyU"
]
from your example. This said, this function has been designed to handle mixed raw text and sometimes markdown for convenience, so if what you want is to retrieve only the second url, this function is ill-suited and it may be time to use a dedicated markdown parser for that (or add some urls_from_markdown
function to ural
).
Regarding your second example, I don't know this way of formatting urls and it looks custom to me so I think you indeed need to use or build a dedicated parser for this kind of stuff.
Released as part of v0.40.1.
Thank you for your quick response! I think your solution makes the most sense as obviously they are both URLs.
Thanks for this tool. We've been using it to extract URLs from various text in dataset and found that it failed here.
Results in:
I've also seen this fail:
https://bit.ly/36ZDXpz:=:https://stirileprotv.ro/stiri/international/live-update-razboi-in-ucraina-macron-convoaca-un-nou-consiliu-de-aparare.html
, but I think that's just a unique format as opposed to something common like markdown.