tj / node-querystring

querystring parser for node and the browser - supporting nesting (used by Express, Connect, etc)
MIT License
455 stars 66 forks source link

Definitive guide to parsing / stringifying querystrings #78

Open buschtoens opened 10 years ago

buschtoens commented 10 years ago

There are a bazillion ways a querystring could and should look. Currently our implementation has some weird oddities, hence all the issues. I'll write out a definitive guide on how we will handle parsing and stringifying in the future, that will be added to the Readme, so qs' behaviour is predictable and as close to current browser form serialization implementations as it can be.

I'll post it here first and everyone is invited to discuss about it. I want it to cover all possible edge cases.

But for now, I need to catch some sleep. Haha. :wink:

tj commented 10 years ago

haha yeah, some proper docs in the readme as far as what to expect from each method would be great

buschtoens commented 10 years ago

Some RFCs

Specs on query strings are rare. I'll try and start to list all relevant parts here.

RFC 3986#3.4: Query

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

query = *( pchar / "/" / "?" )

The characters slash ("/") and question mark ("?") may represent data within the query component. Beware that some older, erroneous implementations may not handle such data correctly when it is used as the base URI for relative references (Section 5.1), apparently because they fail to distinguish query data from path data when looking for hierarchical separators. However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.

RFC 2396#2: URI Characters and Escape Sequences

URI consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as delimiters around URI were excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols were chosen from those common to most of the character encodings and input facilities available to Internet users.

uric = reserved | unreserved | escaped

Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US-ASCII character for that octet ASCII]) or by an escape encoding. This representation is elaborated below.

Further reading:

RFC 2396#3.4: Query Component

The query component is a string of information to be interpreted by the resource.

query = *uric

Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.