simonw / datasette-graphql

Datasette plugin providing an automatic GraphQL API for your SQLite databases
https://datasette-graphql-demo.datasette.io/
Apache License 2.0
101 stars 6 forks source link

Handle tables or columns with invalid names according to GraphQL #10

Closed simonw closed 4 years ago

simonw commented 4 years ago

What makes for a valid GraphQL identifier?

simonw commented 4 years ago

I just hit this trying to run against https://fivethirtyeight.datasettes.com/fivethirtyeight.db:

  File "/Users/simon/.local/share/virtualenvs/datasette-graphql-n1OSJCS8/lib/python3.8/site-packages/graphql/type/definition.py", line 180, in __init__
    assert_valid_name(name)
  File "/Users/simon/.local/share/virtualenvs/datasette-graphql-n1OSJCS8/lib/python3.8/site-packages/graphql/utils/assert_valid_name.py", line 10, in assert_valid_name
    assert COMPILED_NAME_PATTERN.match(
AssertionError: Names must match /^[_a-zA-Z][_a-zA-Z0-9]*$/ but "./index" does not.

That's a complaint about https://fivethirtyeight.datasettes.com/fivethirtyeight/.%2Findex

simonw commented 4 years ago

So regex is ^[_a-zA-Z][_a-zA-Z0-9]*$

https://www.debuggex.com/r/DcZGto0Xw4YE7mRo

Debuggex__Online_visual_regex_tester__JavaScript__Python__and_PCRE_
simonw commented 4 years ago

This happens when you try to use GraphQL against the Datasette fixtures.db database:

Names must match /^[_a-zA-Z][_a-zA-Z0-9]*$/ but "123_starts_with_digitsCollection" does not.

simonw commented 4 years ago

So the rules are a little different from those handled by the to_css_class method in Datasette utils, since that doesn't have a rule about names starting with a non-digit: https://github.com/simonw/datasette/blob/7702ea602188899ee9b0446a874a6a9b546b564d/datasette/utils/__init__.py#L575-L599

css_class_re = re.compile(r"^[a-zA-Z]+[_a-zA-Z0-9-]*$")
css_invalid_chars_re = re.compile(r"[^a-zA-Z0-9_\-]")

def to_css_class(s):
    """
    Given a string (e.g. a table name) returns a valid unique CSS class.
    For simple cases, just returns the string again. If the string is not a
    valid CSS class (we disallow - and _ prefixes even though they are valid
    as they may be confused with browser prefixes) we strip invalid characters
    and add a 6 char md5 sum suffix, to make sure two tables with identical
    names after stripping characters don't end up with the same CSS class.
    """
    if css_class_re.match(s):
        return s
    md5_suffix = hashlib.md5(s.encode("utf8")).hexdigest()[:6]
    # Strip leading _, -
    s = s.lstrip("_").lstrip("-")
    # Replace any whitespace with hyphens
    s = "-".join(s.split())
    # Remove any remaining invalid characters
    s = css_invalid_chars_re.sub("", s)
    # Attach the md5 suffix
    bits = [b for b in (s, md5_suffix) if b]
    return "-".join(bits)

I need something a bit different here.

simonw commented 4 years ago

Easy option: strip out any characters not in [_a-zA-Z0-9] and, if it doesn't start with a letter or underscore, prefix it with an underscore.

Need to also check that it doesn't collide with the name of another table that has undergone the same transformation. For that, can add a suffix of _ or _2 or similar.

simonw commented 4 years ago

I need to apply the same rules to column names AND to those tablename_list nested fields too.

simonw commented 4 years ago

I think a Namer class which I instantiate and then feed raw values to get back valid distinct names that haven't been used in that context yet.

simonw commented 4 years ago
class Namer:
    def __init__(self):
        self.names= set()

    def name(self, value):
        value = strip_invalid_chars(value)
        if not value:
            value = "_"
        if value[0].isdigit():
            value = "_" + value
        suffix = 2
        orig = value
        while value in self.names:
            value = orig += "_" + str(suffix)
            suffix += 1
        self.names.add(value)
        return value
simonw commented 4 years ago

From a code point of view the column names will be the fiddliest bit, since they're used in a bunch of different ways. They only need to be unique with respect to their table though - no need to generate id and then id_2 etc for uniqueness across multiple different tables.