nephila / giturlparse

Parse & rewrite git urls (supports GitHub, Bitbucket, Assembla ...)
https://pypi.python.org/pypi/giturlparse
Apache License 2.0
31 stars 21 forks source link

URL and repo regex should be non-greedy (maybe) #18

Closed GreatBahram closed 3 years ago

GreatBahram commented 5 years ago

Hello, I think there is a problem in this regex or maybe all of them:

import re

pattern = r'https://(?P<domain>.+)/(?P<owner>.+)/(?P<repo>.+?)(?:\.git)?$'
url = 'https://github.com/nephila/giturlparse/blob/master/giturlparse/platforms/github.py'

match = re.match(pattern, url).groupdict()

I suppose, this is an invalid URL, for simple reason git clone would not accept it; however, when we check the match object, things are a bit strange,:

{'domain': 'github.com/nephila/giturlparse/blob/master/giturlparse',
 'owner': 'platforms',
 'repo': 'github.py'}

If we make them non-greedy, it would a little better:

pattern = r'https://(?P<domain>.+?)/(?P<owner>.+?)/(?P<repo>.+?)(?:\.git)?$'
match = re.match(pattern, url).groupdict()
{'domain': 'github.com',
 'owner': 'nephila',
 'repo': 'giturlparse/blob/master/giturlparse/platforms/github.py'}

@yakky

GreatBahram commented 5 years ago

I think the issue #10 is somehow related to this problem.

yakky commented 5 years ago

@GreatBahram thanks for reporting. In PR #19 I implemented a slightly different logic, which should be more consistent across url schemas