Open sivanahamer opened 2 years ago
I do have a fix for the issue, though I don't have permissions to add the corresponding commit with the pull request in this repository.
You can post it here. I'll issue an update this weekend.
Thank you
In oscar.pyx
removed the URLs with the same prefix:
URL_PREFIXES = {
b'drupal.com': b'git.drupal.org',
b'kde.org': b'anongit.kde.org',
b'sourceforge.net': b'git.code.sf.net/p'
}
And then, added the case in the URL property within the Project
:
class Project(_Base):
...
@cached_property
def url(self):
""" Get the URL for a given project URI
>>> Project('CS340-19_lectures').url
'http://github.com/CS340-19/lectures'
"""
prefix, body = self.uri.split(b'_', 1)
if prefix == b'sourceforge.net':
platform = URL_PREFIXES[prefix]
elif prefix in URL_PREFIXES and b'_' in body:
platform = URL_PREFIXES[prefix]
body = body.replace(b'_', b'/', 1)
elif b'_' in body:
platform = prefix
body = body.replace(b'_', b'/', 1)
else:
platform = b'github.com'
body = self.uri.replace(b'_', b'/', 1)
return b'/'.join((b'https:/', platform, body))
P.D: I'm not sure if all URLs will have only one _
to replace.
This will brake github urls, as they always include an underscore separating the username from the project name. I'll think of some other solution, though.
Maybe a solution could be to check if the prefix has “.com”, “.net”, or other common internet domains? Like a url checker.
This is exactly what I am doing now, but there are a few border cases to deal with
Here is how git url is transformed into project id (bb, gl, and dr are just hostnames for ssh as indicated):
s|^bb:([^/]+)/|bitbucket.org_$1_|;
s|^gl:([^/]+)/|gitlab.com_$1_|;
s|^dr:([^/]+)/|drupal.com_$1_|;
s|^https://([^/]*)/([^/]*)/|$1_$2_|;
s|^https://([^/]*)/|$1_|;
In words: two first / are replaced by , if there is only one, it is also replaced by Simple rules to determine non-github origin:
some examples:
https://salsa.debian.org/science-team/plplot
https://gitlab.inria.fr/allgo/notebooks/speads
https://gitlab.freedesktop.org/aphogat/raytracing
https://pagure.io/alien
https://gitlab.inria.fr/guix-hpc/website
https://gitlab.inria.fr/grid5000/kwollect-dashboard
https://git.eclipse.org/r/henshin/org.eclipse.emft.henshin.git/
https://android.googlesource.com/platform/packages/apps/Test/connectivity
A list of prefixes update in Version T, U, S, Q:
android.googlesource.com
anongit.kde.org
blitiri.com.ar
code.ill.fr
code.qt.io
fedorapeople.org
forgemia.inra.fr
framagit.org
gcc.gnu.org
git.alpinelinux.org
git.bioconductor.org
git.code.sf.net
git.eclipse.org
git.kernel.org
git.openembedded.org
git.pleroma.social
git.postgresql.org
git.savannah.gnu.org
git.savannah.nongnu.org
git.torproject.org
git.unicaen.fr
git.unistra.fr
git.xfce.org
git.yoctoproject.org
git.zx2c4.com
gitbox.apache.org
gite.lirmm.fr
gitlab.adullact.net
gitlab.cerema.fr
gitlab.common-lisp.net
gitlab.fing.edu.uy
gitlab.freedesktop.org
gitlab.gnome.org
gitlab.huma-num.fr
gitlab.inria.fr
gitlab.irstea.fr
gitlab.ow2.org
invent.kde.org
notabug.org
pagure.io
repo.or.cz
salsa.debian.org
I was trying to create a url checker. A possibility could be to use a regex such as \.[A-Za-z]+$
. Also, there is a library that parses public domains, if needed be.
In code, with just the regex, it would be...
class Project(_Base):
...
@cached_property
def url(self):
""" Get the URL for a given project URI
>>> Project('CS340-19_lectures').url
'http://github.com/CS340-19/lectures'
"""
prefix, body = self.uri.split(b'_', 1)
if prefix == b'sourceforge.net':
platform = URL_PREFIXES[prefix]
elif prefix in URL_PREFIXES and b'_' in body:
platform = URL_PREFIXES[prefix]
body = body.replace(b'_', b'/', 1)
elif re.search("\.[A-Za-z]+$", prefix.decode()):
platform = prefix
body = body.replace(b'_', b'/', 1)
else:
platform = b'github.com'
body = self.uri.replace(b'_', b'/', 1)
return b'/'.join((b'https:/', platform, body))
I'm not sure if the other cases for sourceforge and the repositories in URL_PREFIXES
are needed. If so, URL_PREFIXES
could be deleted and the rest of the code could be left as...
class Project(_Base):
...
@cached_property
def url(self):
""" Get the URL for a given project URI
>>> Project('CS340-19_lectures').url
'http://github.com/CS340-19/lectures'
"""
prefix, body = self.uri.split(b'_', 1)
if re.search("\.[A-Za-z]+$", prefix.decode()):
platform = prefix
body = body.replace(b'_', b'/', 1)
else:
platform = b'github.com'
body = self.uri.replace(b'_', b'/', 1)
return b'/'.join((b'https:/', platform, body))
Yes, prefix array is superfluous and, furthermore, needs to be changed with each new forge added. May be worth checking what characters are allowed in a top-level domain as the proposed code assumes no numeric characters.
Is there any way I can further help with this issue?
Many thanks: would you mind submitting a fix via PR to get the right author for the fix? (for now, lets not worry about unicode in top-level domain names). I can apply your patch as well if you prefer. The commit comment should start with FIX issue #48 - url's for non-github repositoried
I believe I have no rights to commit in the repository.
@sivanahamer This issue is not that hard to fix, but for the time it is blocked by another change. Apparently there was a fairly recent change in relations, they started to use deduplicated project URL (capital P vs lowercase p). I need to update several more bits of code to make it work, and was holding only because of that.
Unfortunately, I only have a few hours per week to this project - but I'll try to finish it this weekend.
@user2589 There is also a and A for authors/aliased authors. Just to be clear, the lowercase (not aliased) versions are not going away, however.
I am not sure how you plan to handle it, but it may be simplest to have a separate class for project and author for aliased versions (as a completely different object) as combining both in a single class may add complications.
I encountered a case where the repository's owner name contains an underscore. For example: https://gitlab.com/rki_bioinformatics/DeePaC is represented as gitlab.com_rki_bioinformatics_DeePaC. Besides, GitHub used to allow username contain an underscore I think we may consider a more suitable separator between owner name and repository name to allow invertible transformation between repository url and woc repository representation. Or we may maintain another database (or add an extra url field to exist P.metadata collection) to map repository url to woc's repository representation.
Great point.
a) Indeed, the following instances of user/org names on github contain underscore, these need to be added as exceptions to the url translator.
pj_nitin
up_the_irons
imarsh_emusand
alwell-kevin_emusand
nat_contoso
bryanaknight_emusand
alwell-kevin_demo
cjs_emusand
camilogarcialarotta_emusand
sbryant_emusand
jamiecool_emusand
evil-camilo_emusand
rohitnb_fabrikam
RaviGadhia_fabrikam
nielspineda_fabrikam
son7211_fabrikam
b) Renaming non-gh repos to an actual url may be a reasonable approach to deal with url translation for the remaining forges. This can be a task for an upcoming version V.
Hello!
There is an issue with generating the URL for some non-GitHub repositories.
For example, the project with URI
gitlab.freedesktop.org_libinput_libinput
. The following code...Produces the following output...
However, the project URL is
https://gitlab.freedesktop.org/libinput/libinput
.This error can happen to any project that is not hosted in the list of projects in
URL_PREFIXES
. Updating the list of projects with those within the list seems unmanageable, as the are many non-GitHub repositories in WoC.