ssc-oscar / oscar.py

Python interface for OSCAR data
GNU General Public License v3.0
4 stars 10 forks source link

Incorrect URLs for non-GitHub repositories #48

Open sivanahamer opened 2 years ago

sivanahamer commented 2 years ago

Hello!

There is an issue with generating the URL for some non-GitHub repositories.

For example, the project with URI gitlab.freedesktop.org_libinput_libinput. The following code...

print(oscar.Project("gitlab.freedesktop.org_libinput_libinput").url.decode())

Produces the following output...

https://github.com/gitlab.freedesktop.org/libinput_libinput

However, the project URL is https://gitlab.freedesktop.org/libinput/libinput.

This error can happen to any project that is not hosted in the list of projects in URL_PREFIXES. Updating the list of projects with those within the list seems unmanageable, as the are many non-GitHub repositories in WoC.

sivanahamer commented 2 years ago

I do have a fix for the issue, though I don't have permissions to add the corresponding commit with the pull request in this repository.

user2589 commented 2 years ago

You can post it here. I'll issue an update this weekend.

Thank you

sivanahamer commented 2 years ago

In oscar.pyx removed the URLs with the same prefix:

URL_PREFIXES = {
    b'drupal.com': b'git.drupal.org',
    b'kde.org': b'anongit.kde.org',
    b'sourceforge.net': b'git.code.sf.net/p'
}

And then, added the case in the URL property within the Project:

class Project(_Base):
    ...
    @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if prefix == b'sourceforge.net':
            platform = URL_PREFIXES[prefix]
        elif prefix in URL_PREFIXES and b'_' in body:
            platform = URL_PREFIXES[prefix]
            body = body.replace(b'_', b'/', 1)
        elif b'_' in body:
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))

P.D: I'm not sure if all URLs will have only one _ to replace.

user2589 commented 2 years ago

This will brake github urls, as they always include an underscore separating the username from the project name. I'll think of some other solution, though.

sivanahamer commented 2 years ago

Maybe a solution could be to check if the prefix has “.com”, “.net”, or other common internet domains? Like a url checker.

user2589 commented 2 years ago

This is exactly what I am doing now, but there are a few border cases to deal with

audrism commented 2 years ago

Here is how git url is transformed into project id (bb, gl, and dr are just hostnames for ssh as indicated):

s|^bb:([^/]+)/|bitbucket.org_$1_|;
s|^gl:([^/]+)/|gitlab.com_$1_|;
s|^dr:([^/]+)/|drupal.com_$1_|;
s|^https://([^/]*)/([^/]*)/|$1_$2_|;
s|^https://([^/]*)/|$1_|;

In words: two first / are replaced by , if there is only one, it is also replaced by Simple rules to determine non-github origin:

some examples:

https://salsa.debian.org/science-team/plplot
https://gitlab.inria.fr/allgo/notebooks/speads
https://gitlab.freedesktop.org/aphogat/raytracing
https://pagure.io/alien
https://gitlab.inria.fr/guix-hpc/website
https://gitlab.inria.fr/grid5000/kwollect-dashboard
https://git.eclipse.org/r/henshin/org.eclipse.emft.henshin.git/
https://android.googlesource.com/platform/packages/apps/Test/connectivity

A list of prefixes update in Version T, U, S, Q:

android.googlesource.com
anongit.kde.org
blitiri.com.ar
code.ill.fr
code.qt.io
fedorapeople.org
forgemia.inra.fr
framagit.org
gcc.gnu.org
git.alpinelinux.org
git.bioconductor.org
git.code.sf.net
git.eclipse.org
git.kernel.org
git.openembedded.org
git.pleroma.social
git.postgresql.org
git.savannah.gnu.org
git.savannah.nongnu.org
git.torproject.org
git.unicaen.fr
git.unistra.fr
git.xfce.org
git.yoctoproject.org
git.zx2c4.com
gitbox.apache.org
gite.lirmm.fr
gitlab.adullact.net
gitlab.cerema.fr
gitlab.common-lisp.net
gitlab.fing.edu.uy
gitlab.freedesktop.org
gitlab.gnome.org
gitlab.huma-num.fr
gitlab.inria.fr
gitlab.irstea.fr
gitlab.ow2.org
invent.kde.org
notabug.org
pagure.io
repo.or.cz
salsa.debian.org
sivanahamer commented 2 years ago

I was trying to create a url checker. A possibility could be to use a regex such as \.[A-Za-z]+$. Also, there is a library that parses public domains, if needed be.

In code, with just the regex, it would be...

class Project(_Base):
    ...
   @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if prefix == b'sourceforge.net':
            platform = URL_PREFIXES[prefix]
        elif prefix in URL_PREFIXES and b'_' in body:
            platform = URL_PREFIXES[prefix]
            body = body.replace(b'_', b'/', 1)
        elif re.search("\.[A-Za-z]+$", prefix.decode()):
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))

I'm not sure if the other cases for sourceforge and the repositories in URL_PREFIXES are needed. If so, URL_PREFIXES could be deleted and the rest of the code could be left as...

class Project(_Base):
    ...
   @cached_property
    def url(self):
        """ Get the URL for a given project URI
        >>> Project('CS340-19_lectures').url
        'http://github.com/CS340-19/lectures'
        """
        prefix, body = self.uri.split(b'_', 1)
        if re.search("\.[A-Za-z]+$", prefix.decode()):
            platform = prefix
            body = body.replace(b'_', b'/', 1)
        else:
            platform = b'github.com'
            body = self.uri.replace(b'_', b'/', 1)
        return b'/'.join((b'https:/', platform, body))
audrism commented 2 years ago

Yes, prefix array is superfluous and, furthermore, needs to be changed with each new forge added. May be worth checking what characters are allowed in a top-level domain as the proposed code assumes no numeric characters.

sivanahamer commented 2 years ago

I guess from the list of TLD from here and here it seems that the TLD have no numbers, however there are some that have other Unicode characters.

sivanahamer commented 2 years ago

Is there any way I can further help with this issue?

audrism commented 2 years ago

Many thanks: would you mind submitting a fix via PR to get the right author for the fix? (for now, lets not worry about unicode in top-level domain names). I can apply your patch as well if you prefer. The commit comment should start with FIX issue #48 - url's for non-github repositoried

sivanahamer commented 2 years ago

I believe I have no rights to commit in the repository.

user2589 commented 2 years ago

@sivanahamer This issue is not that hard to fix, but for the time it is blocked by another change. Apparently there was a fairly recent change in relations, they started to use deduplicated project URL (capital P vs lowercase p). I need to update several more bits of code to make it work, and was holding only because of that.

Unfortunately, I only have a few hours per week to this project - but I'll try to finish it this weekend.

audrism commented 2 years ago

@user2589 There is also a and A for authors/aliased authors. Just to be clear, the lowercase (not aliased) versions are not going away, however.

I am not sure how you plan to handle it, but it may be simplest to have a separate class for project and author for aliased versions (as a completely different object) as combining both in a single class may add complications.

gaokai320 commented 2 years ago

I encountered a case where the repository's owner name contains an underscore. For example: https://gitlab.com/rki_bioinformatics/DeePaC is represented as gitlab.com_rki_bioinformatics_DeePaC. Besides, GitHub used to allow username contain an underscore I think we may consider a more suitable separator between owner name and repository name to allow invertible transformation between repository url and woc repository representation. Or we may maintain another database (or add an extra url field to exist P.metadata collection) to map repository url to woc's repository representation.

audrism commented 2 years ago

Great point.

a) Indeed, the following instances of user/org names on github contain underscore, these need to be added as exceptions to the url translator.

pj_nitin
up_the_irons
imarsh_emusand
alwell-kevin_emusand
nat_contoso
bryanaknight_emusand
alwell-kevin_demo
cjs_emusand
camilogarcialarotta_emusand
sbryant_emusand
jamiecool_emusand
evil-camilo_emusand
rohitnb_fabrikam
RaviGadhia_fabrikam
nielspineda_fabrikam
son7211_fabrikam

b) Renaming non-gh repos to an actual url may be a reasonable approach to deal with url translation for the remaining forges. This can be a task for an upcoming version V.