Queries fail for column names with dashes

mindsdb / dfsql

SQL interface to Pandas

GNU General Public License v3.0

51 stars 9 forks source link

Queries fail for column names with dashes #17

Closed Seanny123 closed 3 years ago

Seanny123 commented 3 years ago

Given seq.csv:

sequence,affinity,seq-len
ARKKERLW,0.13,8
WAKWAKA,0.1,7
EEERRRKK,0.2,8

The code:

import pandas as pd
from dfsql import sql_query

df = pd.read_csv("demo/alt.csv")
res = sql_query("SELECT * FROM df WHERE seq-len > 7", df=df)
print(res)

Fails with the error:

dfsql.exceptions.QueryExecutionException: Column seq not found.

If the column is renamed seq-len -> seqlen and the query is adjusted accordingly, then no error occurs.

Seanny123 commented 3 years ago

Upon further investigation, this error seems to occur upstream with mindsdb_sql.parse_sql which converts:

SELECT * FROM df WHERE seq-len > 7

Into:

SELECT * FROM df WHERE seq - len > 7

Thus I've opened an issue in the upstream project https://github.com/mindsdb/mindsdb_sql/issues/4

btseytlin commented 3 years ago

@Seanny123 Thank you, I will investigate shortly!

Indeed, the current regex for identifiers (columns, tables, etc) does not allow dashes

btseytlin commented 3 years ago

@Seanny123 I have investigated the issue. SQL standard does not allow identifiers with dashes to avoid ambiguity (e.g. postgres docs). However you can eclose identifiers in ` brackets:

SELECT * FROM df WHERE `seq-len` > 7

This will work. You can even add spaces and other symbols inside and it will still be a valid identifier