URL Detection Problem - Githubissues

lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

MIT License

244 stars 61 forks source link

URL Detection Problem #82

Closed ghost closed 1 year ago

ghost commented 3 years ago

I tried to use the module to detect a link in the following string.

Link:https://www.google.com

but it failed to detect that there is a url.

lipoja commented 3 years ago

Hi @Ricardolcm888 thanks for reporting it. However this is not an easily fixed issue. The text is not typographically correct, there should be space after the first colon sign: Link: https://example.com And yeah I know - internet is full of these mistakes and typos. Is it possible for you to somehow pre-process the text?

I will think about it, however right now I do not see any general solution for this.

amoldavsky commented 2 years ago

Yes this is in fact a problem

from urlextract import URLExtract

extractor = URLExtract()
extractor.find_urls('earn $600 every week, work from home job:https://2.ua/YHfw38')

results in:

["job:https://2.ua/YHfw38"]

@lipoja RE: The text is not typographically correct, there should be space after the first colon sign well, what should be and what is, sadly rarely coincide 🤣

I had to fix this for my ML pre-processing, pretty straightforward fix, will submit a PR shortly...

amoldavsky commented 2 years ago

Here is the PR https://github.com/lipoja/URLExtract/pull/120

@lipoja I would appreciate if you could merge that in and release so I do not have to release a production model off of my code change in a form of a hack in a git branch :)

lipoja commented 2 years ago

@amoldavsky Thank you for contributing! Sure I can merge it and release it. But before I do that I would like to discuss with you few of my ideas so we do not break extraction or unintentionally filter out some URLs which would be extracted with current code. Please have a look to your PR.

amoldavsky commented 2 years ago

Yup, I started a discussion in the PR

Stvad commented 1 year ago

facing the same issue, was curious what is the state of PR for fixing this! :)