Quick discussion: Find a reasonable way to implement the parsing of SHOW TRAIN stmt

lhw362950217 commented 4 years ago

Problem

As written in this design doc #2073, we want to add a new extended statement SHOW TRAIN in SQLFlow. When I digged into our code, I find the way we parse a SQLFlow query is a bit tricky. That's to say, we first let thirdparty parser do the job, and then, if it report an error, we try to split the query at the error point and give another try with the first part. When all goes fine, we try to parse the last part with our extended parser. This work well if we have our extended statements all in the SELECT ... TO ... form, like:

USE iris; SELECT * FROM train_table TO TRAIN xgboost.gbtree ...

However, as we introduce more extended syntaxes, say, the SHOW TRAIN, or any other statement not in the SELECT ... TO ... form in the future, we can't pass the original parse progress. Example here:

USE iris; SHOW TRAIN my_model;
               ^ error occurs, try to parse USE iris; SHOW;
                                                          ^ fail again, return nil

Here are things we may consider:

the thirdparty parser wrapper should make every effort to report as more valid statements as possible, in above example, it should return the first USE statement instead of an empty set

it seems that we also need backtrace to last ';' when we fail to split at the error point, here:

USE iris; SHOW TRAIN my_model;
           ^ error occurs, try to parse USE iris; SHOW;
                                                      ^ fail again, do not return
    ^ we may try split here and return a valid statement `USE iris`
      and the idx after ';'

Some ideas

Here we focus on queries with extended statement, which will fail in thirdparty parsing.

As described above, we may find last ';' when we fail to split at the error point. After a while of discussion with @tonyyang-svail , we find we have to deal with some complicated situations like the ';' is in a comment or is quoted. Fortunately, @shendiaomo has written a split logic to handle this #1772. pros: straightforward, easy to implement cons: I'm not sure the split method is applicable for all our use case, say, hive and calcite
Above method try to parse as more input as possible at one time. Is it possible to parse one statement once? If so, the thirdparty parsers are helping us to do the split. pros: more strong and robust to split cons: as I see, some thirdparty parser do not support parse one statement at a time
Extend our parser to handle all queries(partially), let our parser extract the SELECT ... TO ... pattern and then throw the SELECT ... part to thirdparty parser. Please notice we are not implement a full featured parser, but to use some error recovering mechanism to extract the expected part. When we get an error, we still should throw the error part to thirdparty parser. Note that some parser generator support this error recover mechanism. We can write these strategies to syntax rules rather than implement it in hard-coded logic aside our parser.
```
USE iris; SHOW TRAIN my_model;
^ error here
        ^ recover to end of the statement, throw `USE iris;` to thirdparty parser

SELECT * FORM train_table TO TRAIN ...
       ^ error here because our parser to not support `SELECT *`
                          ^ recover to TO and throw `SELECT ...` to thirdparty parser
```
pros: an unified way for extending, we may control more aspect of parsing cons: hard to implement, as dialects may differ in many ways, we need a lot effort to tune the lexer and parser

Action

As to now, I would prefer to use the first one because of its simplicity, and only needing make minor modification to the code. But other ideas seems make sense in some way. I noticed we had some discussion before, so, help me with more context info and some suggestion please!

tonyyang-svail commented 4 years ago

FYI, all parsers support parsing one statement. Calcite/Hive/ODPS parser requires input to be one statement. TiDB parser can be reimplemented to support ParseOneStmt.

So maybe option2 is doable?

linkerzhang commented 4 years ago

The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.

llxxxll commented 4 years ago

在什么应用场景下，会需要查看 TO TRAIN SQL呢？如果是方便用户构建TO PREDICT SQL ，是否还需要打印出COLUMN 的数据格式，可以帮助用户更好的定义输入数据。

lhw362950217 commented 4 years ago

在什么应用场景下，会需要查看 TO TRAIN SQL呢？

目前主要是方便用户查看当时训练的配置，比如用的是哪个数据源，以及模型参数相关的设置，现在用户还是需要将这些数据记在别的地方备忘，有这一条语句后就可以直接从模型中解析并展示。

如果是方便用户构建TO PREDICT SQL ，是否还需要打印出COLUMN 的数据格式，可以帮助用户更好的定义输入数据。

是的，后续增加更多元数据展示之后也可以供 model market 展示时使用。可以输出一个新的结果列来描述COLUMN信息。

lhw362950217 commented 4 years ago

The option 1 may be buggy and option 3 will take much more efforts, though option 2 "per statement parsing" may introduce some efforts to create the whole AST, say, the connections among statements.

Yes, as for now, we are doing only syntatic parsing, it seems ok, I will give a try. I have also concerned about the semantic co-relation among statements. This is why I did not take this way at frist time. But after I saw @tonyyang-svail 's comment, I'm more sure this is not a big problem because parsers like Hive only support parse one statement at a time.

lhw362950217 commented 4 years ago

Add context to our extended parser

I want to propose a way to add context info to thirdparty parser's result for easing the parse process of our extended parser. See example: Before, we only support syntax rules in one form:

SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...

Now, we support:

1. SELECT * FROM iris.train TO TRAIN/PREDICT/EXPLAIN ...
2. SHOW TRAIN 'my_model'

Suppose we are parsing a sql:

SELECT * FROM iris.train SHOW TRAIN 'my_table'
                         ^ split here

Thirdparty parser and our extended parser will both return a success, but this is obviously an invalid extended sql. So I have to write some go code like:

if thirdpartyParse(sql) && extendParse(sql) { // both parser accept
  if firstPartIsSelect(sql) && secondPartIsTo(sql) { // must satisfy hand-coded constraints
    // go on
  }
  // if we have more extended rules, we have to write more constraints here, like:
  if firstPartIsDescribe(sql) && secondPartIsModel(sql) {
    // maybe describing a model
  }
}

Can we put these constraints to our extended syntax rules? I think there is a Yes. When we finish a thirdparty parsing, we already know it is a SELECT stmt with extended part(because there is no ';' or '\n' to indicate an EOL), the ONLY way to be valid is followed by a TO. So, we may take this information as a context for our extended parser. We do this:

SELECT * FROM iris.train SHOW TRAIN 'my_table'
                         ^ split here
SELECT * FROM iris.train <select> SHOW TRAIN 'my_table'
                         ^  insert an imaginary token here
                                  ^ error will be reported here without any
                                    hand-coded logic, because SHOW can't 
                                    happen in <select> context in extended rules

and rewrite our parser rule as:

sqlflow_stmt
: <select> TO TRAIN ...   // we know `TO TRAIN` must in a <select> context
| <select> TO PREDICE ...
| <select> TO EXPLAIN ...
| <show> IDENTIFY // maybe our show train is in a <show> context
| <;> pure extended stmt // pure extended stmt maybe in a 'after-EOL' context
;

here \ are just Terminals represent contexts, in implementation, we may prefix the second half with these literals, and they will be extracted as some new Terminals by our lexer. This is somehow equivalent to parse our rule on top of thirdparty parser's stack.

Give me some hints if u have better ideas!

About the syntax

By the way, DESCRIBE MODEL my_model(as we discussed) is valid in SQL to show a column's meta, so we can't use it now.

lhw362950217 commented 4 years ago

Conclusion

After our discussion, we have made three decisions as follows:

use per-statement parsing strategy (proposal 2 in first section)
use SHOW TRAIN 'model_name' as the syntax rule
we currently only return the original training statement as a column, in the future, we may add other metadata as extra columns of the result
do not add context to the extended syntax for we now don't know the form of new coming syntaxes for sure, instead we use hand-coded judgement to make sure the parsing stmt is valid

sql-machine-learning / sqlflow