sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.09k stars 699 forks source link

Quick discussion: Find a reasonable way to implement the parsing of SHOW TRAIN stmt #2094

Open lhw362950217 opened 4 years ago

lhw362950217 commented 4 years ago

Problem

As written in this design doc #2073, we want to add a new extended statement SHOW TRAIN in SQLFlow. When I digged into our code, I find the way we parse a SQLFlow query is a bit tricky. That's to say, we first let thirdparty parser do the job, and then, if it report an error, we try to split the query at the error point and give another try with the first part. When all goes fine, we try to parse the last part with our extended parser. This work well if we have our extended statements all in the SELECT ... TO ... form, like:

USE iris; SELECT * FROM train_table TO TRAIN xgboost.gbtree ...

However, as we introduce more extended syntaxes, say, the SHOW TRAIN, or any other statement not in the SELECT ... TO ... form in the future, we can't pass the original parse progress. Example here:

USE iris; SHOW TRAIN my_model;
               ^ error occurs, try to parse USE iris; SHOW;
                                                          ^ fail again, return nil

Here are things we may consider:

Some ideas

Here we focus on queries with extended statement, which will fail in thirdparty parsing.

Action

As to now, I would prefer to use the first one because of its simplicity, and only needing make minor modification to the code. But other ideas seems make sense in some way. I noticed we had some discussion before, so, help me with more context info and some suggestion please!

tonyyang-svail commented 4 years ago

FYI, all parsers support parsing one statement. Calcite/Hive/ODPS parser requires input to be one statement. TiDB parser can be reimplemented to support ParseOneStmt.

So maybe option2 is doable?

linkerzhang commented 4 years ago

The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.

llxxxll commented 4 years ago

在什么应用场景下,会需要查看 TO TRAIN SQL呢? 如果是方便用户构建TO PREDICT SQL ,是否还需要打印出COLUMN 的数据格式,可以帮助用户更好的定义输入数据。

lhw362950217 commented 4 years ago

在什么应用场景下,会需要查看 TO TRAIN SQL呢?

目前主要是方便用户查看当时训练的配置,比如用的是哪个数据源,以及模型参数相关的设置,现在用户还是需要将这些数据记在别的地方备忘,有这一条语句后就可以直接从模型中解析并展示。

如果是方便用户构建TO PREDICT SQL ,是否还需要打印出COLUMN 的数据格式,可以帮助用户更好的定义输入数据。

是的,后续增加更多元数据展示之后也可以供 model market 展示时使用。可以输出一个新的结果列来描述COLUMN信息。

lhw362950217 commented 4 years ago

The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.

Yes, as for now, we are doing only syntatic parsing, it seems ok, I will give a try. I have also concerned about the semantic co-relation among statements. This is why I did not take this way at frist time. But after I saw @tonyyang-svail 's comment, I'm more sure this is not a big problem because parsers like Hive only support parse one statement at a time.

lhw362950217 commented 4 years ago

Add context to our extended parser

I want to propose a way to add context info to thirdparty parser's result for easing the parse process of our extended parser. See example: Before, we only support syntax rules in one form:

SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...

Now, we support:

1. SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...
2. SHOW TRAIN 'my_model'

Suppose we are parsing a sql:

SELECT * FROM iris.train SHOW TRAIN 'my_table'
                         ^ split here

Thirdparty parser and our extended parser will both return a success, but this is obviously an invalid extended sql. So I have to write some go code like:

if thirdpartyParse(sql) && extendParse(sql) { // both parser accept
  if firstPartIsSelect(sql) && secondPartIsTo(sql) { // must satisfy hand-coded constraints
    // go on
  }
  // if we have more extended rules, we have to write more constraints here, like:
  if firstPartIsDescribe(sql) && secondPartIsModel(sql) {
    // maybe describing a model
  }
}

Can we put these constraints to our extended syntax rules? I think there is a Yes. When we finish a thirdparty parsing, we already know it is a SELECT stmt with extended part(because there is no ';' or '\n' to indicate an EOL), the ONLY way to be valid is followed by a TO. So, we may take this information as a context for our extended parser. We do this:

SELECT * FROM iris.train SHOW TRAIN 'my_table'
                         ^ split here
SELECT * FROM iris.train <select> SHOW TRAIN 'my_table'
                         ^  insert an imaginary token here
                                  ^ error will be reported here without any
                                    hand-coded logic, because SHOW can't 
                                    happen in <select> context in extended rules

and rewrite our parser rule as:

sqlflow_stmt
: <select> TO TRAIN ...   // we know `TO TRAIN` must in a <select> context
| <select> TO PREDICE ...
| <select> TO EXPLAIN ...
| <show> IDENTIFY // maybe our show train is in a <show> context
| <;> pure extended stmt // pure extended stmt maybe in a 'after-EOL' context
;

here \ are just Terminals represent contexts, in implementation, we may prefix the second half with these literals, and they will be extracted as some new Terminals by our lexer. This is somehow equivalent to parse our rule on top of thirdparty parser's stack.

Give me some hints if u have better ideas!

About the syntax

By the way, DESCRIBE MODEL my_model(as we discussed) is valid in SQL to show a column's meta, so we can't use it now.

lhw362950217 commented 4 years ago

Conclusion

After our discussion, we have made three decisions as follows:

  1. use per-statement parsing strategy (proposal 2 in first section)
  2. use SHOW TRAIN 'model_name' as the syntax rule
  3. we currently only return the original training statement as a column, in the future, we may add other metadata as extra columns of the result
  4. do not add context to the extended syntax for we now don't know the form of new coming syntaxes for sure, instead we use hand-coded judgement to make sure the parsing stmt is valid