Proposal: Managing external & internal parsers

tonyyang-svail commented 4 years ago

Problem

SQLFlow calls Java HiveQL/ODPS parser to help to parse the SQL program. As shown in the following dependency graph, each parser is wrapped as a gRPC server which takes a string and returns the parsed result.

test3

However, some parsers like ODPS parser can't be open-sourced. This leads to the circular dependencies between the internal code and open-sourced code, i.e. the open-sourced Java gRPC server needs to call ODPS parser while the close-sourced ODPS parser needs to return ParseResult.

test

Solution

We remove the dependencies from gRPC server to ODPS parser via dynamic loading. To be specific, the open-sourced code creates ODPS parser instance by the following.

Object parser = Class.forName("org.sqlflow.parser.internal.odpsParser....").newInstance();  
((ParserInterface)parser).parse(sql);

By doing so, we only have a one-way dependency from internal repo to the GitHub repo.

test2

Implementation Details

GitHub repo:

Each parser should be an extension of a ParserInterface.

interface ParseInterface {
ParseResult parse(String sql);
}

Internal repo:

The building process starts from building the GitHub repo into a .jar file then add it to the Maven project.
The deployment of the internal .jar file should be the same as the open-sourced .jar file, since they share the same entry point.

cc @typhoonzero @weiguoz

typhoonzero commented 4 years ago

This is definitely needed. Each parser would be one independent jar file to the gRPC service.

weiguoz commented 4 years ago

Totally agree. Also, we need to deploy the interface to a maven repository, such as: mvnrepository.com or some private repository.

tonyyang-svail commented 4 years ago

Following the comments, I updated the design as the following.

test4

The parserInterface is used (required) by gRPC parser server.
The parserInterface is realized (implemented) by each SQL dialect parser.
The parserInterface is released to Maven central repository. guide
Each SQL dialect parser imports the parserInterface and forms a standalone .jar package.
The gRPC server imports the parserInterface and forms a standalone .jar package.
During deployment, the gRPC server package creates parser instances via dynamic loading. This can be done by setting parser's classpath (e.g. org.sqlflow.parser.internal.odpsParser) in the command line argument and including the parser .jar file in the CLASSPATH. We can also make the server takes multiple parser classpaths to deploy multiple parsers in one server.

TODO list:

[x] Refactor codebase to use parserInterface.
[x] Package the parser code and gRPC code into separate .jar file, and enable dynamic loading.
[x] Release parserInterface package to remote repository and make code base to depend on the remote package.

tonyyang-svail commented 4 years ago

I am following this guide to release SQLFlow Java code to Central Maven Repository:

[x] Create a Jira Account and a Ticket for your Group Id. Ticket: https://issues.sonatype.org/browse/OSSRH-55437
[x] Create a GPG Key and Sign your Artifacts
[x] Update your Pom and Publish your Artifacts

tonyyang-svail commented 4 years ago

The Java gRPC server loads the parser via dynamic loading (implementation ref). To make the loading configurable, we need to pass it as a parameter, which requires the following fields.

The jar file path, e.g. /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar.
The class name within the jar file, e.g. org.sqlflow.parser.calcite.CalciteParserAdaptor
The dialect name associated with the parser, e.g. calcite.

I am proposing configuring it as a -l flag (l for loading). And the format is

{jar_file_path}/{dialect}/{class_name}

For example, /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar/calcite/org.sqlflow.parser.calcite.CalciteParserAdaptor.

Also, we can support configuring on loading multiple .jar files by separating each config by :, i.e.

{config1}:{config2}:...

@typhoonzero @weiguoz @Yancey1989 Please let me know if you are happy with this format. Any suggestions are welcomed.

typhoonzero commented 4 years ago

Well, I suggest use a more straight forward method:

Add one -l flag configuring the directory of jar files, like -l /opt/sqlflow/parser/adapters, we can load all the jar files under this directory
Add one -p flag configuring all parser classes to load, like org.sqlflow.parser.calcite.CalciteParserAdaptor, we can get the dialect name from the class name then.

weiguoz commented 4 years ago

The Java gRPC server loads the parser via dynamic loading (implementation ref). To make the loading configurable, we need to pass it as a parameter, which requires the following fields.

The jar file path, e.g. /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar.

The class name within the jar file, e.g. org.sqlflow.parser.calcite.CalciteParserAdaptor

The dialect name associated with the parser, e.g. calcite.

I am proposing configuring it as a -l flag (l for loading). And the format is
{jar_file_path}/{dialect}/{class_name}
For example, /opt/sqlflow/parser/parser-calcite-0.0.1-dev-jar-with-dependencies.jar/calcite/org.sqlflow.parser.calcite.CalciteParserAdaptor.

Also, we can support configuring on loading multiple .jar files by separating each config by :, i.e.
{config1}:{config2}:...
@typhoonzero @weiguoz @Yancey1989 Please let me know if you are happy with this format. Any suggestions are welcomed.

It looks good. Meanwhile, I think @typhoonzero 's straight forward method is simpler.

weiguoz commented 4 years ago

Shall we separate the ClassLoader(..)&loadClass(..) from parse()? https://github.com/sql-machine-learning/sqlflow/blob/dbfbf655bfd63cfed44ae4b4559fe21529efb015/java/parser/src/main/java/org/sqlflow/parser/ParserGrpcServer.java#L95-L100

tonyyang-svail commented 4 years ago

Shall we separate the ClassLoader(..)&loadClass(..) from parse()?

Agree. ClassLoader should be called only once during the initialization of the server.

tonyyang-svail commented 4 years ago

Instead of uploading the interface to the remote, we can simply install it locally.

sql-machine-learning / sqlflow