package-url / packageurl-python

Python implementation of the package url spec. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ , the Google Summer of Code, nexB and other generous sponsors.
69 stars 43 forks source link

Simplify route declaration in url2purl #53

Open tdruez opened 3 years ago

tdruez commented 3 years ago

Following https://github.com/package-url/packageurl-python/pull/51/files/c1d41a8930b0b89dfc3774b4e18d89de5089e593..7877bb50102482468bdb9b32476d5a6151dc368e#r508692262

Working with regex syntax is always hard but should not be necessary for most of the simple routes. For example, a common pattern '[^/]+' in path segment should be abstracted for better readability and new route addition.

We could re-use some ideas from the recent Django's URL route system that now replaces the old regex system: https://docs.djangoproject.com/en/3.1/topics/http/urls/#url-dispatcher

This system abstracts the regex complexity into "converters", for example r'^articles/(?P<year>[0-9]{4})/$' becomes articles/<yyyy:year>/

Using a current url2purl example:

Could become:

Much easier to write and to read.


Playing around with the Django's _route_to_regex

from django.urls.resolvers import _route_to_regex

route = "https://raw.githubusercontent.com/<str:namespace>/<str:name>/<str:version>/<path:subpath>"
pattern = _route_to_regex(route, is_endpoint=True)[0]
# -> "^https\\:\\/\\/raw\\.githubusercontent\\.com\\/(?P<namespace>[^/]+)\\/(?P<name>[^/]+)\\/(?P<version>[^/]+)\\/(?P<subpath>.+)$"

url = "https://raw.githubusercontent.com/LeZuse/flex-sdk/master/frameworks/projects/mx/src/mx/containers/accordionClasses/AccordionHeader.as"
re.compile(pattern, re.VERBOSE).match(url).groupdict()
# -> {'namespace': 'LeZuse', 'name': 'flex-sdk', 'version': 'master', 'subpath': 'frameworks/projects/mx/src/mx/containers/accordionClasses/AccordionHeader.as'}

We could add custom converter for the specific needs of purl https://docs.djangoproject.com/en/3.1/topics/http/urls/#registering-custom-path-converters Some parts like the (http|https) will need support as well as the domain section is not part of the Django system:

from django.urls.resolvers import _route_to_regex
from django.urls.converters import register_converter
from django.urls.converters import StringConverter

class ProtocolConverter(StringConverter):
    regex = '(http|https|ftp)'

register_converter(ProtocolConverter, 'protocol')

route = "<protocol:protocol>://raw.githubusercontent.com/<str:namespace>/<str:name>/<str:version>/<path:subpath>"
_route_to_regex(route, is_endpoint=True)[0]

'^(?P<protocol>(http|https|ftp))\\:\\/\\/raw\\.githubusercontent\\.com\\/(?P<namespace>[^/]+)\\/(?P<name>[^/]+)\\/(?P<version>[^/]+)\\/(?P<subpath>.+)$'
TG1999 commented 3 years ago

Cool stuff, will surely try this out :D

pombredanne commented 2 years ago

There are also URI templates to consider https://datatracker.ietf.org/doc/html/rfc6570 FWIW