pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
884 stars 244 forks source link

Classic TiSpark Sunset Project #2779

Open sunxiaoguang opened 7 months ago

sunxiaoguang commented 7 months ago

Problem:

In recent years, TiSpark has served as an indispensable component of the TiDB ecosystem, enabling seamless integration between Spark and the TiDB ecosystem. However, after careful consideration and evaluation, we have come to the difficult decision of deprecating the TiSpark Classic Architecture. This decision is driven by several important factors that we believe will ultimately benefit our users and streamline our development efforts. Let us delve into the reasons behind this initiative.

By deprecating the TiSpark Classic Architecture, we can refocus our development efforts on a more robust and efficient framework. This strategic shift will allow us to allocate resources to projects that offer greater value and enhanced performance, benefiting the entire TiDB community.

In conclusion, while we acknowledge the significant impact of deprecating the TiSpark Classic Architecture, this decision is driven by our commitment to delivering the best possible experience to TiDB users. We remain dedicated to maintaining forward compatibility, promoting innovation, and enhancing the efficiency of our products and services. Together, we believe we can pave the way for future advancements, enabling users to leverage the true power of the TiDB ecosystem.

Goals

By achieving these objectives, we aim to create a more reliable, user-friendly, and maintainable TiSpark product, ultimately enhancing the overall experience for both new and existing users.

Architecture

This section discusses the new architectural design for the next generation TiSpark. The focus of this design is to improve performance, overcome data screw issues, handle large transactions effectively, and enable easy integration with fully managed TiDB services on cloud.

The new architecture will be based on generic Spark JDBC support. By leveraging Spark's JDBC capabilities, we can tap into its powerful data processing and analytical functionalities.

To address the data screw issue commonly observed in Spark JDBC data sources, we introduce a new SQL statement SHOW TABLE SPLITS to retrieve the primary key ranges for the table. This ensures that the data is evenly partitioned at the region level, typically 128MB each. With this fine-grained and balanced task definition, Spark can offer higher concurrency based on the user's preferences, while avoiding data skew issues.

In order to overcome the limitations of large transactions when accessing TiDB with JDBC, we intentionally break down large transactions into smaller units of work and commit them individually. Although this approach has performance regressions comparing to the Classic bypass TiDB TiSpark architecture, it is more important to ensure data integrity and correctness. In the past, TiSpark had faced issues causing data corruption, so prioritizing data integrity over risky large transaction support is crucial.

Furthermore, with this new architecture design, projections and predicates can be easily pushed down to the JDBC layer. This allows for more efficient query execution by pushing down filtering and column projection operations closer to the data source.

Lastly, accessing TiDB directly through this new design makes it possible to seamlessly work with fully managed TiDB services in the cloud. Users can take advantage of TiDB's scalability, reliability, and automated management features, making it an ideal choice for cloud-based deployments.

Overall, the new architecture design for TiSpark aims to improve performance, overcome data screw issues, handle large transactions effectively, and provide easy integration with fully managed TiDB services on the cloud.

Migration checklist

This section provides an overview of the migration checklist for end users who are transitioning from the Classic TiSpark architecture to the new JDBC-based TiSpark architecture. The migration process entails considering multiple factors to facilitate a seamless and prosperous transition.

  1. Review System Requirements: Assess the system requirements for the new JDBC-based TiSpark architecture. Since TiSpark utilizes TiDB servers for loading extensive data, users should either add more TiDB servers or ensure that the current setup possesses sufficient reserved capacity. Alternatively, users could setup separate TiDB servers exclusively for TiSpark to separate the ELT workload from online transactional workloads, providing enhanced protection for transactional workloads.

  2. Perform Compatibility Testing: The JDBC datasource is a vital component integrated into Spark distribution, offering improved compatibility. Nevertheless, it is strongly advised to perform comprehensive compatibility testing to ensure seamless integration of existing applications and components with the new JDBC-based TiSpark architecture. It is crucial to verify that existing Spark jobs, SQL queries, and data processing pipelines operate as expected within this new environment.

  3. Performance Testing and Optimization: Conduct thorough performance testing and optimization of the new JDBC-based TiSpark. Compare the performance metrics with the Classic TiSpark to identify significant performance improvements or regressions. Fine-tune the new architecture to optimize resource utilization, data processing efficiency, and query performance.

By following this migration checklist, users can transition from the Classic TiSpark to the new architecture smoothly, minimizing any potential disruptions while simultaneously enjoying improved performance, reliability, and overall system efficiency.

Conclusion

The TiSpark sunset project serves as a pivotal step towards enhancing the TiSpark product's maintainability, reliability, and user experience. By adhering to the objectives, scope, timeline, and key responsibilities mentioned in this document, We aim to seamlessly transition from the current generation TiSpark to a more efficient and feature-rich successor product.

FAQ

References

Pros and Cons of Spark with Generic JDBC

Pros

Pros and Cons of Classic TiSpark

Pros

Pros and Cons of Next Generation TiSpark powered by JDBC

Pros

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.