simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.53k stars 685 forks source link

Tilde encoding: use ~ instead of - for dash-encoding #1657

Closed simonw closed 2 years ago

simonw commented 2 years ago

Refs #1439

simonw commented 2 years ago

The problem with the dash encoding mechanism is that it turns out dashes are used in a LOT of existing Datasette instances - much of https://fivethirtyeight.datasettes.com/fivethirtyeight for example, and even https://datasette.io/ itself: https://datasette.io/dogsheep-index

It's pretty ugly to force all of those to change to their dash-encoded equivalent - and in fact it broke https://datasette.io/ in a subtle way:

I'm going to try using ~ instead and see if that works as well and causes less breakage to existing sites.

simonw commented 2 years ago

Asked about this on Twitter:

Anyone ever seen a proxy or other URL handling system do anything surprising with the tilde "~" character?

I'm considering it as an escaping character, in place of "-" as described in

Replies so far seem like it should be OK - Apache has supported this for home directories for a couple of decades now without any problems.

simonw commented 2 years ago

Relevant: https://datatracker.ietf.org/doc/html/rfc3986#section-2.1


      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Notably ~ is not in either of those lists.

simonw commented 2 years ago

And in https://datatracker.ietf.org/doc/html/rfc3986#section-2.3 "Unreserved Characters":

  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
simonw commented 2 years ago

Updated test:

@pytest.mark.parametrize(
    "original,expected",
    (
        ("abc", "abc"),
        ("/foo/bar", "~2Ffoo~2Fbar"),
        ("/-/bar", "~2F-~2Fbar"),
        ("-/db-/table.csv", "-~2Fdb-~2Ftable~2Ecsv"),
        (r"%~-/", "~25~7E-~2F"),
        ("~25~7E~2D~2F", "~7E25~7E7E~7E2D~7E2F"),
    ),
)
def test_tilde_encoding(original, expected):
    actual = utils.tilde_encode(original)
    assert actual == expected
    # And test round-trip
    assert original == utils.tilde_decode(actual)
simonw commented 2 years ago

I've made a real mess of this. I'm going to revert Datasettemain back to the last commit that passed the tests and try this again in a branch.

simonw commented 2 years ago

The state I had got to prior to that revert is in https://github.com/simonw/datasette/tree/issue-1657-wip

simonw commented 2 years ago

The thing that broke everything was this change:

image

I'm going to bring back the horrible get_format() method for the moment, with its weird mutations of the args object, then try and get rid of it again later.

simonw commented 2 years ago

Moving this to a PR.

simonw commented 2 years ago

Documentation: https://docs.datasette.io/en/latest/internals.html#tilde-encoding

simonw commented 2 years ago

Now live here: https://fivethirtyeight.datasettes.com/fivethirtyeight/august-senate-polls~2Faugust_senate_polls

simonw commented 2 years ago

Demo: