Double URL entries - Githubissues

mrichar1 / clipster

clipster - python clipboard manager

GNU Affero General Public License v3.0

242 stars 26 forks source link

Double URL entries #60

Open demure opened 7 years ago

demure commented 7 years ago

I'm not exactly sure how this happens, but I am able to recreate on my system with three different browsers, so I assume it is related to clipster. This has been happening for a while as well, I've just been too lazy to document it.

It looks like any time a top level page's URL is copied, you get two entries in clipster, the second entry (the newer of the two) differs by having a trailing '/'.

Example:

https://www.google.com/
https://www.google.com
https://duckduckgo.com/
https://duckduckgo.com

This does not appear to happen for some urls with content after the '/, depending on what I assume is a relationship to certain special characters like '.' and '-'.

Example of non double entry:

https://github.com/qutebrowser/qutebrowser/blob/master/INSTALL.asciidoc

Thanks for your time. -demure

mrichar1 commented 7 years ago

Hi - thanks for reporting this issue.

The problem comes from trying to come up with a simple but effective regular expression to 'find' urls within other strings.

At present the code uses: \b\S+://\S+\b - the problem here is that the trailing slash is interpreted as a word boundary, so is removed.

I've tried other alternatives, such as: \w+:\/\/\S+ - this solves the trailing slash problem - however it instead sees all trailing punctuation as part of the url, so: click the link (http://example.com) please. results in http://example.com)

The correct solution is probably to implement a match that matches the uri rfcs - however this requires dropping a very large regex into the code - of which there are many examples, each with varying levels of 'correctness' - see for example: https://mathiasbynens.be/demo/url-regex

In this case, I'd suggest that anyone who really wants to perfectly match uris should drop one of these patterns into their patterns file and disable url matching in the config.

I'm of course very willing to accept a better regex that maintains as much simplicity as possible and removes this bug without introducing any other ones - suggestions welcome!

demure commented 7 years ago

Would it be possible to use the deduplication feature to mitigate this?

When looking the code, I see:

# Check for growing or shrinking, but ignore duplicates
if last_item and text != last_item and (text in last_item or last_item in text):

Since the issue seems to be that first https://www.google.com enters the clipboard and then https://www.google.com/ enters the clipboard after it, could we have a check for:

if last_item+'/' == text:

or some such?

Sorry if I am miss interpreting anything here, and thanks for your time.

EDIT:

# Extracted patterns are added to the history before the selection, and the clipbaord buffer is left unchanged.

Makes it sound like the https://www.google.com is actually extracted... so I guess the above wouldn't be right then... though maybe the uri extraction could have a check for a one char change between original/extracted text?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mrichar1 commented 2 years ago

This may be fixed by #101 since the pattern matching regexes have been updated. Any feedback on whether this fixes things welcome!