pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
941 stars 103 forks source link

Backticks `x` as an alias for Q('x') #100

Open DSLituiev opened 7 years ago

DSLituiev commented 7 years ago

This is a suggestion to implement backticks as an alias for quoting Q('...'). E.g.:

Q('x/y')  
  ==
`x/y` 

Rationale:

  1. Traditional: R syntax allows addressing fields as:
    data$`x/y`
  2. User convenience: less letters and no need to select two different quotation marks.
njsmith commented 7 years ago

Interesting idea. Thinking about the tradeoffs, there are two downsides I can see:

(1) Backticks look very similar to single-quotes. In Python in general the BDFL has pronounced that backticks won't be assigned any meaning, because of this usability problem ("syntax shouldn't look like grit on Tim's monitor"). I guess this is also somewhat of an advantage for us b/c it means that they won't be assigned any other meaning.

(2) Patsy currently relies on Python's tokenizer. Because Python doesn't use backticks as a quoting marker, the Python tokenizer crashes if fed backticks:

In [6]: list(patsy.tokens.python_tokenize("foo + `baz`"))
---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-6-024d1474cf98> in <module>()
----> 1 list(patsy.tokens.python_tokenize("foo + `baz`"))

/home/njs/.user-python3.5-64bit/lib/python3.5/site-packages/patsy/tokens.py in python_tokenize(code)
     37                 raise PatsyError("error tokenizing input "
     38                                  "(maybe an unclosed string?)",
---> 39                                  origin)
     40             if pytype == tokenize.COMMENT:
     41                 raise PatsyError("comments are not allowed", origin)

PatsyError: error tokenizing input (maybe an unclosed string?)
    foo + `baz`
         ^

So the only way to implement this would be to fork our own copy of the tokenizer, and then make sure to keep it up to date with each Python release. (Actually, we would need multiple forks - at least one for python 2 and one for python 3, maybe more.) Unfortunately I don't see any way to really make this viable :-(

DSLituiev commented 7 years ago

How about putting a thin layer on it:

def replace_backticks(x):
    if "`" not in x:
        return x
    pttrn = re.compile("`([^`]*)`")
    def repl(m):
        return "Q('" + m.group(1) + "')"
    return pttrn.sub(repl, x)

testlist = ["a ~ `50%`", 
      "t + `x/2` = `y` + `z`",
      "`x%z` ~ `a.z`",
      "a` ~ 12", 
      "y~x-1"]
for x in testlist:
    result = replace_backticks(x)
    print("="*20)
    print(x)
    print(result)

Returns:

====================
a ~ `50%`
a ~ Q('50%')
====================
t + `x/2` = `y` + `z`
t + Q('x/2') = Q('y') + Q('z')
====================
`x%z` ~ `a.z`
Q('x%z') ~ Q('a.z')
====================
a` ~ 12
a` ~ 12
====================
y~x-1
y~x-1

Note that the example 4 is broken

njsmith commented 7 years ago

Other broken cases include things like the odd but currently valid)

Q("foo`bar")

I guess this isn't tooo bad because backticks are very rarely used, but... I dunno. I really like the thing where we use a real parser with fully-defined behavior.

njsmith commented 7 years ago

I guess the other option would be some sort of fancy error-recovery support, where if lexing crashes we detect this case (the first unparsed character is backtick) and recover. Sounds messy but potentially doable...

DSLituiev commented 7 years ago

Here is handling of back ticks within Q('')

import re

def _check_backticks_within_Q_(x):
    pttrn = re.compile("(Q\([\'\"]).*`.*([\'\"]\))")
    res = pttrn.finditer(x)
    try:
        next(res)
        return True
    except StopIteration:
        return False

def _replace_backticks_(m):
    return "Q('" + m.group(1) + "')"

def replace_backticks(x):
    if "`" not in x:
        return x
    elif _check_backticks_within_Q_(x):
        return x
    pttrn = re.compile("`([^`]*)`")
    return pttrn.sub(_replace_backticks_, x)

testlist = ["a ~ `50%`", 
      "t + `x/2` = `y` + `z`",
      "`x%z` ~ `x!#%^`",
      "y~x-1", 
        "y ~ Q('x`')",
        "y ~ Q('`x`')",
        "w ~ Q( ' x`!#%^' ) + Q('r1`')",
        'w ~ Q( " x`!#%^" )']

for x in testlist:
    result = replace_backticks(x)
    print("="*20)
    print(x)
    print(result)

Output:

====================
a ~ `50%`
a ~ Q('50%')
====================
t + `x/2` = `y` + `z`
t + Q('x/2') = Q('y') + Q('z')
====================
`x%z` ~ `x!#%^`
Q('x%z') ~ Q('x!#%^')
====================
y~x-1
y~x-1
====================
y ~ Q('x`')
y ~ Q('x`')
====================
y ~ Q('`x`')
y ~ Q('`x`')
====================
w ~ Q( ' x`!#%^' ) + Q('r1`')
w ~ Q( ' x`!#%^' ) + Q('r1`')
====================
w ~ Q( " x`!#%^" )
w ~ Q( " x`!#%^" )