staskobzar / url_parser_re2c

Parsing URL (rfc3986) with re2c
MIT License
2 stars 1 forks source link

License and schemaless URL #1

Open yohgaki opened 8 years ago

yohgaki commented 8 years ago

I'm looking for URL/URI parser implemented by re2c. Nice work! What is the license of this code? MIT/BSD is preferred :)

It seems schemaless URL, e.g. '//example.com/path/to/file', is not supported yet. It's used often with HTTP/HTTPS mixed sites. Will this be supported? http://greenbytes.de/tech/webdav/rfc3986.html#reference-resolution

It seems it does not support PATH only URL also. This is supported by HTTP Location header. http://greenbytes.de/tech/webdav/rfc7231.html#header.location

http://greenbytes.de/tech/webdav/rfc3986.html#components

The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty The scheme and path components are required, though the path may be empty (no characters). When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). These restrictions result in five different ABNF rules for a path (Section 3.3), only one of which will match any given URI reference.

So, section 3.3 defines 5 different ABNF http://greenbytes.de/tech/webdav/rfc3986.html#path

path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters

path-abempty = ( "/" segment ) path-absolute = "/" [ segment-nz ( "/" segment ) ] path-noscheme = segment-nz-nc ( "/" segment ) path-rootless = segment-nz ( "/" segment ) path-empty = 0

segment = _pchar segment-nz = 1_pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":"

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

If all of these are supported, it would be great!!

staskobzar commented 8 years ago

Hello Yasuo,

Thank you for your comment.

You are right, there is no support for relative (including schemaless) URLs. It is a simple project just for learning re2c and use it in Ruby C extensions. I am not sure if I will be able to add this support soon, but I am thinking about it.

I have another similar project which is using Ragel for parsing URLs and it should have better support for RFC3986:

https://github.com/staskobzar/uri_scanner

It produces Ruby code but can be easily changed to produce C code.

As I said, both projects were created in learning purposes and I am not using them in any production projects. So I can not guarantee they are 100% working.

I have added MIT license, so feel free to take and use it.

Have a good day!

yohgaki commented 8 years ago

Thank you for the reply. MIT license is great!! I'll look into the URI scanner also!