medialab / ural

A helper library full of URL-related heuristics.
GNU General Public License v3.0
63 stars 12 forks source link

urls_from_text does not work on markdown links #153

Closed dale-wahl closed 1 year ago

dale-wahl commented 1 year ago

Thanks for this tool. We've been using it to extract URLs from various text in dataset and found that it failed here.

from ural import urls_from_text
test_text = "[https://youtu.be/rLZ2ZzoD-W0](https://youtu.be/rLZ2ZzoD-W0?fbclid=IwAR3RdUNf4_yyYxIBbAspDj-86ckbpS9gjv3tn2rhYspmFJuSl_dlkD7AgyU)"
[url for url in urls_from_text(test_text)]

Results in:

['https://youtu.be/rLZ2ZzoD-W0](https://youtu.be/rLZ2ZzoD-W0?fbclid=IwAR3RdUNf4_yyYxIBbAspDj-86ckbpS9gjv3tn2rhYspmFJuSl_dlkD7AgyU']

I've also seen this fail: https://bit.ly/36ZDXpz:=:https://stirileprotv.ro/stiri/international/live-update-razboi-in-ucraina-macron-convoaca-un-nou-consiliu-de-aparare.html, but I think that's just a unique format as opposed to something common like markdown.

Yomguithereal commented 1 year ago

@dale-wahl I just pushed a commit to enhance robustness of url extraction from markdown text. The function should now yield:

[
  "https://youtu.be/rLZ2ZzoD-W0",
  "https://youtu.be/rLZ2ZzoD-W0?fbclid=IwAR3RdUNf4_yyYxIBbAspDj-86ckbpS9gjv3tn2rhYspmFJuSl_dlkD7AgyU"
]

from your example. This said, this function has been designed to handle mixed raw text and sometimes markdown for convenience, so if what you want is to retrieve only the second url, this function is ill-suited and it may be time to use a dedicated markdown parser for that (or add some urls_from_markdown function to ural).

Regarding your second example, I don't know this way of formatting urls and it looks custom to me so I think you indeed need to use or build a dedicated parser for this kind of stuff.

Yomguithereal commented 1 year ago

Released as part of v0.40.1.

dale-wahl commented 1 year ago

Thank you for your quick response! I think your solution makes the most sense as obviously they are both URLs.