Open lhw362950217 opened 4 years ago
FYI, all parsers support parsing one statement. Calcite/Hive/ODPS parser requires input to be one statement. TiDB parser can be reimplemented to support ParseOneStmt
.
So maybe option2 is doable?
The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.
在什么应用场景下,会需要查看 TO TRAIN SQL
呢?
如果是方便用户构建TO PREDICT SQL
,是否还需要打印出COLUMN
的数据格式,可以帮助用户更好的定义输入数据。
在什么应用场景下,会需要查看
TO TRAIN SQL
呢?
目前主要是方便用户查看当时训练的配置,比如用的是哪个数据源,以及模型参数相关的设置,现在用户还是需要将这些数据记在别的地方备忘,有这一条语句后就可以直接从模型中解析并展示。
如果是方便用户构建
TO PREDICT SQL
,是否还需要打印出COLUMN
的数据格式,可以帮助用户更好的定义输入数据。
是的,后续增加更多元数据展示之后也可以供 model market 展示时使用。可以输出一个新的结果列来描述COLUMN
信息。
The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.
Yes, as for now, we are doing only syntatic parsing, it seems ok, I will give a try. I have also concerned about the semantic co-relation among statements. This is why I did not take this way at frist time. But after I saw @tonyyang-svail 's comment, I'm more sure this is not a big problem because parsers like Hive only support parse one statement at a time.
I want to propose a way to add context info to thirdparty parser's result for easing the parse process of our extended parser. See example: Before, we only support syntax rules in one form:
SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...
Now, we support:
1. SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...
2. SHOW TRAIN 'my_model'
Suppose we are parsing a sql:
SELECT * FROM iris.train SHOW TRAIN 'my_table'
^ split here
Thirdparty parser and our extended parser will both return a success, but this is obviously an invalid extended sql. So I have to write some go code like:
if thirdpartyParse(sql) && extendParse(sql) { // both parser accept
if firstPartIsSelect(sql) && secondPartIsTo(sql) { // must satisfy hand-coded constraints
// go on
}
// if we have more extended rules, we have to write more constraints here, like:
if firstPartIsDescribe(sql) && secondPartIsModel(sql) {
// maybe describing a model
}
}
Can we put these constraints to our extended syntax rules? I think there is a Yes.
When we finish a thirdparty parsing, we already know it is a SELECT
stmt with extended part(because there is no ';' or '\n' to indicate an EOL), the ONLY way to be valid is followed by a TO
. So, we may take this information as a context for our extended parser. We do this:
SELECT * FROM iris.train SHOW TRAIN 'my_table'
^ split here
SELECT * FROM iris.train <select> SHOW TRAIN 'my_table'
^ insert an imaginary token here
^ error will be reported here without any
hand-coded logic, because SHOW can't
happen in <select> context in extended rules
and rewrite our parser rule as:
sqlflow_stmt
: <select> TO TRAIN ... // we know `TO TRAIN` must in a <select> context
| <select> TO PREDICE ...
| <select> TO EXPLAIN ...
| <show> IDENTIFY // maybe our show train is in a <show> context
| <;> pure extended stmt // pure extended stmt maybe in a 'after-EOL' context
;
here \
Give me some hints if u have better ideas!
By the way, DESCRIBE MODEL my_model
(as we discussed) is valid in SQL to show a column's meta, so we can't use it now.
After our discussion, we have made three decisions as follows:
SHOW TRAIN 'model_name'
as the syntax rule
Problem
As written in this design doc #2073, we want to add a new extended statement
SHOW TRAIN
in SQLFlow. When I digged into our code, I find the way we parse a SQLFlow query is a bit tricky. That's to say, we first let thirdparty parser do the job, and then, if it report an error, we try to split the query at the error point and give another try with the first part. When all goes fine, we try to parse the last part with our extended parser. This work well if we have our extended statements all in theSELECT ... TO ...
form, like:However, as we introduce more extended syntaxes, say, the
SHOW TRAIN
, or any other statement not in theSELECT ... TO ...
form in the future, we can't pass the original parse progress. Example here:Here are things we may consider:
USE
statement instead of an empty setSome ideas
Here we focus on queries with extended statement, which will fail in thirdparty parsing.
Extend our parser to handle all queries(partially), let our parser extract the
SELECT ... TO ...
pattern and then throw theSELECT ...
part to thirdparty parser. Please notice we are not implement a full featured parser, but to use some error recovering mechanism to extract the expected part. When we get an error, we still should throw the error part to thirdparty parser. Note that some parser generator support this error recover mechanism. We can write these strategies to syntax rules rather than implement it in hard-coded logic aside our parser.pros: an unified way for extending, we may control more aspect of parsing cons: hard to implement, as dialects may differ in many ways, we need a lot effort to tune the lexer and parser
Action
As to now, I would prefer to use the first one because of its simplicity, and only needing make minor modification to the code. But other ideas seems make sense in some way. I noticed we had some discussion before, so, help me with more context info and some suggestion please!