Open dmitrizagidulin opened 7 years ago
https://github.com/medialize/URI.js is the most comprehensive library for URI parsing and manipulation I've found. Documentation at http://medialize.github.io/URI.js/docs.html
https://github.com/garycourt/uri-js has a bit less functionalites BUT supports conversion to and from IRIs, and still has normalization and parsing which is basically all we need.
For IRI to/from URI manipulations https://github.com/awwright/node-iri is probably the most complete, but it's only for node (shouldnt be too hard to port tho)
URI validation is still kind of a arbitrary cause a lot of URI are "valid" as far as the RFC is concerned but will probably not be interepreted as expected in case the specification is used "loosely".
In the case of the aforementioned file:/// protocol if you try to process "file://C:/rdfKB/test.xml"
(which is wrong for a couple of reasons, mainly cause the protocol requires the triple slash if there's no host), URI.js returns:
{ protocol: 'file',
username: null,
password: null,
hostname: 'C',
urn: null,
port: null,
path: '/rdfKB/test.xml',
query: null,
fragment: null,
duplicateQueryParameters: false,
escapeQuerySpace: true },
_deferred_build: false }
normalized as: file://c/rdfKB/test.xml
Since the syntax says that the first thing after the 2 slashes is the host, that's how C is treated.
Not allowing more than one colon is not a solution cause urn:isbn:0201896834
is a valid uri. In general URN use colon as separators. Also, "file:///C:/rdfKB/test.xml"
is probably the preferred way to express a local file on windows as that will correctly be parsed by browsers and node.fs. "file:///C/rdfKB/test.xml" will generally not work, while C|/ is sometimes allowed but discouraged (chrome is fine with it but I haven't tested the various libraries). The exception is the xmlhttprequest node library under windows (that wants 2 slashes) which is frankly just an oversight on their part (at the moment using 3 slashes under windows will fail because of this). It would be probably best to test in the fetcher if the URI is using the file:/// protocol and we are under node, in which case you could use node.fs that is more predictable than the stubbed xhr.
Since file can basically only be used for local files it would probably be best to enforce the 3 slashes internally to have more consistency across kb generated on different machines, but to do that we need to manage the windows xhr problem somehow.
Regarding the whitespaces, those are in theory also allowed in the RFC, and if you use URI.js or uri-js you can just call .normalize() to get the version with the %20 replacements.
I am familiar with / have used URI.js before, so that one gets a :+1: from me.
Continuing the moved discussion from https://github.com/linkeddata/rdflib.js/pull/170/files
I think we should limit the existence of any bad URIs like file://C:\\destination\\folder/file.ttl
to as close to offending libraries as possible, that is, I understand node's process.cwd() on windows until it is fixed, and XHR on windows. I am surprised there is not a lot of pressure to fix them. We should not be passing around or storing RDF serialized which includes them.
(By the way, a useful library function to add to rdflib.js is to return a default base address, either a file:///
Agreed, colons do occur elsewhere in URIs, like for the port number as in http://localhost:8000/
, so in file://C:/ gets parsed as a host.
The (rarely used) idea behind a hostname in the file:// was just a note of which computer the URI was minted on, which could for example help the user figure out what to do with it. Not to cause a remote access.
I am loth to change URI library unless we really have to. How long does the validation take -- will it just add more time to all our processing? Often parsers will do it, and when they don't its probably because they are aiming for speed. Happy to have a rdf.uri.validate() function to app developers who want to use it, but not necessarily to call it at new NamedNode()
I don't think we necessarily need a URI library, but we do need to validate URIs to prevent a whole set of issues, including serializing bad URIs to graphs and breaking the semantic web.
If a URI validation tool serves our use cases and doesn't add too much size to the bundle, then we should go with that. If we can't find one, then we'll have to write our own, but I doubt we'll have trouble finding a suitable off-the-shelf tool.
I think process.cwd() is behaving as intended, as it is more of a system tool for local manipulations and on windows that format (reverse slash and no starting slash) is what you need to work with node.fs and in general the file system. In my mocha tests I'm just doing a dumb replacement of all slashes to forward slashes after process.cwd(), at least it doesn't look that bad, but I'm not sure if a backward slash could pop up and break that (you can't use those on file or folder names afaik).
The xmlhttprequest stub library we are using had its last significant update more than 2 years ago, so if we need that we might as well fork it and fix it ourselfs.
URI.js is modular, just the basic lib plus IPv6 support uncompressed is 30kb. uri-js is just 6.7kb uncompressed, plus 7.6kb for the URI/IRI support if we need it. I don't know about performances tho.
I'd avoid placing URI validation on top of every node creation as that would probably kill the processing time for big files. There should maybe be an option to be set on the fetcher, and an option when creating nodes for that. Also an option for normalization would be useful, replacing capitals and escaping special characters. Both those libs can easily do that.
Just a quick speed test:
This is a really rough test, run through node on local, with just a single url with some escapes and a simple query in it.
Currently (as of PR #167), we're only checking for spaces in named nodes' IRI values, and checking for absolute vs relative. As @dan-f mentioned, we should use a dedicated IRI validation library & do this properly.