spaceandtimelabs / sxt-proof-of-sql

Space and Time | Proof of SQL
Other
2.56k stars 91 forks source link

Replace `proof-of-sql-parser` with `sqlparser`. #235

Open JayWhite2357 opened 1 month ago

JayWhite2357 commented 1 month ago

Background and Motivation

Currently, we have an in-house parser that is built on the lalrpop parser-generator. This has been good while the supported syntax has been simple. However, as the supported syntax has grown, we need a more comprehensive parser.

The sqlparser crate is the parser used by DataFusion, which is part of the Arrow ecosystem. It is a feature-rich parser that ultimately will require less code maintenance. It is no_std compatible, so there should be no issues integrating it.

Changes Required

JayWhite2357 commented 1 month ago

@iajoiner may be able to provide a bit of guidance on this issue, however, there is no concrete plan for this. It would be wise to discuss this plan with @iajoiner and myself to ensure that things are heading in the right direction before building in earnest.

JayWhite2357 commented 1 month ago

/bounty $10000

algora-pbc[bot] commented 1 month ago

šŸ’Ž $10,000 bounty ā€¢ Space and Time

Steps to solve:

  1. Start working: (Optional) Comment /attempt #235 with your implementation plan. Note: we will only assign an issue if you include an implementation plan with a time estimate. Additionally, to be assigned an issue, you must have previously contributed to the project. You can still work on an issue and submit a PR without being assigned.
  2. Submit work: Create a pull request including /claim #235 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to spaceandtimelabs/sxt-proof-of-sql!

Add a bounty ā€¢ Share on socials

Attempt Started (GMT+0) Solution
šŸ”“ @varshith257 Oct 9, 2024, 1:50:03 PM WIP
šŸ”“ @animeshd9 Oct 9, 2024, 8:21:53 PM WIP
šŸŸ¢ @TomBebb Oct 15, 2024, 10:39:12 PM WIP
šŸŸ¢ @deependujha Oct 16, 2024, 12:17:06 PM WIP
varshith257 commented 1 month ago

/attempt #235

Algora profile Completed bounties Tech Active attempts Options
@varshith257 15 bounties from 7 projects
Go, Scala,
TypeScript & more
ļ¹Ÿ239
Cancel attempt
varshith257 commented 1 month ago

@JayWhite2357 I will connect on discord for the same

animeshd9 commented 1 month ago

/attempt #235

varshith257 commented 1 month ago

Hi @JayWhite2357 and @iajoiner,I'd like to proceed with the transition from the LALRPOP-based parser to sqlparser for the proof-of-sql-parser crate. I have gone through current implementation and working of proof-of-sql parser. Here's the approach I have in mind:

  • [ ] I will add the sqlparser crate to handle the parsing of SQL queries.

  • [ ] Then I move ahead to the replace the LALRPOP Parser

    • [ ] The LALRPOP-based parser defined in sql.lalrpop will be replaced by a new parsing function that utilizes sqlparser to parse
    • [ ] This will specifically focus on the SELECT statements following PostgreSQL syntax as it currently does(Select Statement Parsing, Expressions and Operators, Lexer Rules, ORDER/GROUP BY, Literals and Identifiers etc...)
  • [ ] ~Next I step to map the sqlparser AST to existing intermediate AST (intermediate_ast::SelectStatement and other structures) to minimize changes to the rest of the crate.This will ensure the current logic built on top of the AST stays intact~ Next use the sqlparser::ast structures entirely, phasing out the current intermediate AST.

  • [ ] I will thoroughly test the changes incorporate at every stage to ensure all current functionality is maintained.

  • [ ] I think we can also adapt the existing ParseError and ParseResult types to handle errors from sqlparser

Before moving forward with this plan in my mind, I wanted to check in and ensure this approach aligns with the migration goals. If there are any specific concerns or alternative suggestions, Iā€™d be happy to adjust the plan.

I am looking forward to your feedback!

JayWhite2357 commented 1 month ago

I'm looking into it a bit, and I feel like sqlparser::ast::Query is thing that would be analogous to intermediate_ast::SelectStatement.

My initial intent was for the sqlparser AST to replace the intermediate AST. This would mean that QueryExpr::try_new would accept a sqlparser::ast::Query (or similar) instead of a intermediate_ast::SelectStatement. In this situation, most (if not all) of the code changes would be in the proof_of_sql::sql::parse module.

Only replacing lalrpop with sqlparser, but not replacing intermediate AST is an interesting idea that I hadn't thought of. Perhaps it makes sense as a stepping stone, but I feel like it can't be the end goal here.

@iajoiner might have some feedback on this.

varshith257 commented 1 month ago

Thanks @JayWhite2357 for your view on this. @iajoiner Any insights on this

JayWhite2357 commented 1 month ago

Thanks @JayWhite2357 for your view on this. @iajoiner Any insights on this

I chatted with him. He's on the same page. The goal here should be to remove the intermediate AST altogether.

varshith257 commented 1 month ago

Got it! I just started with basic SELECT parsing logic to see how sqlparser works with an example.

@JayWhite2357 I have joined Discord. If we have a thread in a related channel on discord, we can easily track all progress and discuss more about it in moving forward.

iajoiner commented 1 month ago

@varshith257 I can chat with you on Discord. What's your handle there?

varshith257 commented 1 month ago

@varshith257 I can chat with you on Discord. What's your handle there?

@iajoiner I'd: vamshi_257

iajoiner commented 1 month ago

Cool! Just sent you a message there.

TomBebb commented 1 month ago

/attempt #235

deependujha commented 1 month ago

/attempt #235

varshith257 commented 1 month ago

@JayWhite2357 Is @iajoiner is on holiday? I am also thinking of connecting with you on discord :

Here's my ID: : vamshi_257

JayWhite2357 commented 1 month ago

@varshith257 we're all a bit swamped at the moment. I connected with you on discord as well.

JayWhite2357 commented 1 month ago

The main effort here should be adding a function here: https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/fd3a339fa97ae0e91f4a0c9b65c46924fcc07f7e/crates/proof-of-sql/src/sql/parse/query_expr.rs#L45 The function should look like this:

pub fn try_new_from_sqlparser(
    ast: sqlparser::ast::Query,
    default_schema: Identifier,
    schema_accessor: &dyn SchemaAccessor,
) -> ConversionResult<Self>;

This should be able to be added without changing any existing code. Once it is in place and well tested, we can rename it to try_new and drop the existing try_new method. After that, the proof-of-sql-parser crate will be unused and we can drop it along with related code.

There are probably some complexities here that I'm missing. It may not be as straightforward as this. (For example, since QueryExpr depends on Identifier, which is a proof-of-sql-parser type, we will need to move or replace it before dropping proof-of-sql-parser completely.)

iajoiner commented 1 month ago

@varshith257 @TomBebb @deependujha Some possible replacements

  • proof_of_sql_parser::Identifier -> sqlparser::ast::Ident
  • proof_of_sql_parser::ResourceId -> Vec<Ident> (we can move ResourceId into proof-of-sql and redefine it as [Ident; 2] if necessary)
  • proof_of_sql_parser::intermediate_ast::BinaryOperator -> sqlparser::ast::BinaryOperator
  • proof_of_sql_parser::intermediate_ast::UnaryOperator -> sqlparser::ast::UnaryOperator
  • proof_of_sql_parser::intermediate_ast::Expression -> sqlparser::ast::Expr
  • proof_of_sql_parser::SelectStatement -> sqlparser::ast::Query
iajoiner commented 1 month ago

A few more ideas:

  • proof_of_sql_parser::utility needs to be modified to use sqlparser structs instead.
  • For timestamp it requires a bit more work. What we need to aim at is most likely Timestamp(Option<u64>, TimezoneInfo). (@Dustin-Ray any insights?)
iajoiner commented 1 month ago

@varshith257 @TomBebb @deependujha Let's get this done ASAP and we will be more than willing to help unblock you guys if necessary. Please feel free to shoot me a message either here or on Discord (username: chloemargaret) if you guys have any questions.

varshith257 commented 1 month ago

Sure @iajoiner. Lets aim to complete this ASAP within a few weeks

JayWhite2357 commented 1 month ago

Just discussed this with @iajoiner some more. The plan should be to focus on adding the pub fn try_new_from_sqlparser function. Once this is added, we can make the appropriate replacements. This has the benefit of not being a breaking change and not requiring large amounts of refactorization at any point. Instead the swap can be done after the new sqlparser functionality is well-tested.

This should be able to be added without changing any existing code: specifically, the proof-of-sql-parser crate should not need to be modified until after the try_new_from_sqlparser function is completed.

varshith257 commented 1 month ago

@JayWhite2357 @iajoiner, I have DM you on Discord about a blocker. Have a look

iajoiner commented 2 weeks ago

@JayWhite2357 @animeshd9 @deependujha @TomBebb @varshith257 Proposed steps

  • [x] Add sqlparser.rs to the proof-of-sql-parser crate and adapt a SelectStatement to sqlparser::ast::Query. #338 The subtask has been taken by me.
  • [ ] Replace any reference to proof-of-sql-parser constructs in proof-of-sql crate with corresponding sqlparser constructs. Your work in the previous step should make conversion very easy. Note that sqlparser construct are mostly strict supersets of their corresponding proof-of-sql-parser constructs you need to add error handling for what PoSQL doesn't currently support. Here are some suggestions:
    • [x] proof_of_sql_parser::intermediate_ast::UnaryOperator -> sqlparser::ast::UnaryOperator #348
    • [ ] proof_of_sql_parser::intermediate_ast::BinaryOperator -> sqlparser::ast::BinaryOperator #349
    • [ ] proof_of_sql_parser::Identifier -> sqlparser::ast::Ident #350
    • [ ] proof_of_sql_parser::ResourceId -> Vec<Ident> / sqlparser::ast::ObjectName #352
    • [ ] proof_of_sql_parser::intermediate_ast::OrderBy -> sqlparser::ast::OrderByExpr #354
    • [ ] adapt proof_of_sql_parser::posql_time for sqlparser #351
    • [ ] proof_of_sql_parser::intermediate_ast::{Literal, Expression} -> sqlparser::ast::Expr #353
    • [ ] proof_of_sql_parser::SelectStatement -> sqlparser::ast::Query #355 When this process is complete you will naturally have this as QueryExpr::try_new:
      pub fn try_new(
      ast: sqlparser::ast::Query,
      default_schema: sqlparser::ast::Ident,
      schema_accessor: &dyn SchemaAccessor,
      ) -> ConversionResult<Self>; 

      There should be no construct of proof-of-sql-parser still used in the proof-of-sql crate any more.

  • [ ] Remove the workspace, the proof-of-sql-parser crate and simplify the repo to be a single crate: the proof-of-sql crate. #356
iajoiner commented 2 weeks ago

@animeshd9 @TomBebb @varshith257 Since we need proof-of-sql-parser -> sqlparser adaptation to make sure we don't add additional use cases of proof-of-sql-parser constructs in proof-of-sql crate (in fact #334 already adds additional use cases which complicates this issue) I decided to do the first step myself.

The rest of the subtasks will be added as separated issues linked back here with separate bounties. We will set up deadlines so that you guys know when each task has to be completed before the bounty period ends and we start to work on the task internally. We will usually give a heads up at least a week prior to the deadline for each subtask.