pypidb issues - Githubissues

Continuing from #63 , these are the known issues (list will grow).

As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.

https://github.com/lipoja/URLExtract/issues/62 (high priority bug)
https://github.com/lipoja/URLExtract/issues/67 (medium priority enhancement)
https://github.com/lipoja/URLExtract/issues/36 (medium priority bug)
https://github.com/lipoja/URLExtract/issues/43 (annoying)
https://github.com/lipoja/URLExtract/issues/13 (bug, e.g. https://pypi.org/project/ebcdic/ )

Others I think are harder and may not be in urlextract scope:

Lots of annoying invalid .py domains filtered out by DNS checking, such as setup.py which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.
Relative urls https://github.com/jayvdb/pypidb/issues/38 This would be a huge enhancement to URLExtract, but requires adding a completely different extraction algorithm.
DOS/Maximum results https://github.com/lipoja/URLExtract/issues/69
http://docs.red-dove.com/cfg/python.html e.target is really common, appearing in <script> blocks, but I am not sure it would be useful to exclude urls found in script tags via https://pypi.org/project/config

{{ in url ; pydevd-pycharm

DEBUG    pypidb._pypi:_pypi.py:313 processing Webpage: https://ci.appveyor.com/project/fabioz/pydev-debugger
DEBUG    pypidb._pypi:_pypi.py:379 @@ ran <function _url_extractor_wrapper at 0x7f03e2f1b5e0> on text size 7901 for 8 urls !!
DEBUG    pypidb._pypi:_pypi.py:384 extracted ['account.name', 'https://help.appveyor.com/', 'https://js.stripe.com/v2/', 'https://status.appveyor.com/', 'https://www.appveyor.com/docs/', 'https://www.appveyor.com/docs/server/', 'https://www.appveyor.com/updates/', 'https://www.gravatar.com/avatar/{{Session.user().gravatarHash}}?d=https%3a%2f%2fci.appveyor.com%2fassets%2fimages%2fuser.png&s=40']

backticks are not trimmed, related to https://github.com/lipoja/URLExtract/issues/13

'git://github.com/ingydotnet/package-py.git``'

so I use

_scm_url_cleaner.py:                repo = repo.strip("`")

lipoja / URLExtract

pypidb issues #68