sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.09k stars 698 forks source link

Proposal: Managing external & internal parsers #1904

Closed tonyyang-svail closed 4 years ago

tonyyang-svail commented 4 years ago

Problem

SQLFlow calls Java HiveQL/ODPS parser to help to parse the SQL program. As shown in the following dependency graph, each parser is wrapped as a gRPC server which takes a string and returns the parsed result.

test3

However, some parsers like ODPS parser can't be open-sourced. This leads to the circular dependencies between the internal code and open-sourced code, i.e. the open-sourced Java gRPC server needs to call ODPS parser while the close-sourced ODPS parser needs to return ParseResult.

test

Solution

We remove the dependencies from gRPC server to ODPS parser via dynamic loading. To be specific, the open-sourced code creates ODPS parser instance by the following.

Object parser = Class.forName("org.sqlflow.parser.internal.odpsParser....").newInstance();  
((ParserInterface)parser).parse(sql);  

By doing so, we only have a one-way dependency from internal repo to the GitHub repo.

test2

Implementation Details

GitHub repo:

  1. Each parser should be an extension of a ParserInterface.
    interface ParseInterface {
    ParseResult parse(String sql);
    }

Internal repo:

  1. The building process starts from building the GitHub repo into a .jar file then add it to the Maven project.
  2. The deployment of the internal .jar file should be the same as the open-sourced .jar file, since they share the same entry point.

cc @typhoonzero @weiguoz

typhoonzero commented 4 years ago

This is definitely needed. Each parser would be one independent jar file to the gRPC service.

weiguoz commented 4 years ago

Totally agree. Also, we need to deploy the interface to a maven repository, such as: mvnrepository.com or some private repository.

tonyyang-svail commented 4 years ago

Following the comments, I updated the design as the following.

test4

  1. The parserInterface is used (required) by gRPC parser server.
  2. The parserInterface is realized (implemented) by each SQL dialect parser.
  3. The parserInterface is released to Maven central repository. guide
  4. Each SQL dialect parser imports the parserInterface and forms a standalone .jar package.
  5. The gRPC server imports the parserInterface and forms a standalone .jar package.
  6. During deployment, the gRPC server package creates parser instances via dynamic loading. This can be done by setting parser's classpath (e.g. org.sqlflow.parser.internal.odpsParser) in the command line argument and including the parser .jar file in the CLASSPATH. We can also make the server takes multiple parser classpaths to deploy multiple parsers in one server.

TODO list:

tonyyang-svail commented 4 years ago

I am following this guide to release SQLFlow Java code to Central Maven Repository:

tonyyang-svail commented 4 years ago

The Java gRPC server loads the parser via dynamic loading (implementation ref). To make the loading configurable, we need to pass it as a parameter, which requires the following fields.

I am proposing configuring it as a -l flag (l for loading). And the format is

{jar_file_path}/{dialect}/{class_name}

For example, /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar/calcite/org.sqlflow.parser.calcite.CalciteParserAdaptor.

Also, we can support configuring on loading multiple .jar files by separating each config by :, i.e.

{config1}:{config2}:...

@typhoonzero @weiguoz @Yancey1989 Please let me know if you are happy with this format. Any suggestions are welcomed.

typhoonzero commented 4 years ago

Well, I suggest use a more straight forward method:

  1. Add one -l flag configuring the directory of jar files, like -l /opt/sqlflow/parser/adapters, we can load all the jar files under this directory
  2. Add one -p flag configuring all parser classes to load, like org.sqlflow.parser.calcite.CalciteParserAdaptor, we can get the dialect name from the class name then.
weiguoz commented 4 years ago

The Java gRPC server loads the parser via dynamic loading (implementation ref). To make the loading configurable, we need to pass it as a parameter, which requires the following fields.

  • The jar file path, e.g. /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar.
  • The class name within the jar file, e.g. org.sqlflow.parser.calcite.CalciteParserAdaptor
  • The dialect name associated with the parser, e.g. calcite.

I am proposing configuring it as a -l flag (l for loading). And the format is

{jar_file_path}/{dialect}/{class_name}

For example, /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar/calcite/org.sqlflow.parser.calcite.CalciteParserAdaptor.

Also, we can support configuring on loading multiple .jar files by separating each config by :, i.e.

{config1}:{config2}:...

@typhoonzero @weiguoz @Yancey1989 Please let me know if you are happy with this format. Any suggestions are welcomed.

It looks good. Meanwhile, I think @typhoonzero 's straight forward method is simpler.

weiguoz commented 4 years ago

Shall we separate the ClassLoader(..)&loadClass(..) from parse()? https://github.com/sql-machine-learning/sqlflow/blob/dbfbf655bfd63cfed44ae4b4559fe21529efb015/java/parser/src/main/java/org/sqlflow/parser/ParserGrpcServer.java#L95-L100

tonyyang-svail commented 4 years ago

Shall we separate the ClassLoader(..)&loadClass(..) from parse()?

Agree. ClassLoader should be called only once during the initialization of the server.

tonyyang-svail commented 4 years ago

Instead of uploading the interface to the remote, we can simply install it locally.