lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

comma extracted at the end if url ends with comma #123

Closed amoldavsky closed 5 months ago

amoldavsky commented 2 years ago

This should not be the case:

>>> from urlextract import URLExtract
>>> extractor = URLExtract()
>>> extractor.find_urls("https://www.formpl.us/form/1653896001, work independently from home")

['https://www.formpl.us/form/1653896001,']
controldev commented 2 years ago

The same happens with dots (i.e. '.'), which is a relatively frequent error, for example when sentences end with links.

lipoja commented 2 years ago

@amoldavsky @controldev Hello. Thank you for reporting this issue. I agree that this is not ideal. And I would like to ask you for help in form of discussion because I do not see easy general solution to this problem. What my suggestion would be is postprocessing.

User (in this case you) is the one using this tool. User should know what kind of text is processing. And therefore user can update URLs just by removing extra comma if he expects to be there. It can be done by using simple .rtrim(',').

If you look on this issue in general. I can no easily remove every dot or comma at the end of URL because it might be part of the URL.

However I am open for discussion, maybe you have some solution in mind that we can agree on and implement it.

lipoja commented 5 months ago

Closing this issue since there is no further discussion and simple solution is recommended to user.