Classic TiSpark Sunset Project

Problem:

In recent years, TiSpark has served as an indispensable component of the TiDB ecosystem, enabling seamless integration between Spark and the TiDB ecosystem. However, after careful consideration and evaluation, we have come to the difficult decision of deprecating the TiSpark Classic Architecture. This decision is driven by several important factors that we believe will ultimately benefit our users and streamline our development efforts. Let us delve into the reasons behind this initiative.

Complexity of Interfacing with TiDB Components: TiSpark interacts with various core components of the TiDB ecosystem, including PD (Placement Driver), TiKV (Key-Value Store), and TiFlash (Columnar Storage Engine). These components employ complex protocols that demand tight client-server coordination and intricate client-side logic to ensure correctness. While TiSpark as a standalone component has not seen active development, keeping pace with newer versions of PD, TiKV, and TiFlash introduces fragility and makes it challenging to maintain compatibility and reliability.
Challenges in Managed TiDB Services: The decision to bypass TiDB and establish direct connections with PD, TiKV, and TiFlash has unintentionally created difficulties when integrating TiSpark with managed TiDB services. By deprecating the classic architecture, we aim to simplify the integration process and enhance compatibility with managed TiDB services, thereby enabling a more seamless experience for our users.

By deprecating the TiSpark Classic Architecture, we can refocus our development efforts on a more robust and efficient framework. This strategic shift will allow us to allocate resources to projects that offer greater value and enhanced performance, benefiting the entire TiDB community.

In conclusion, while we acknowledge the significant impact of deprecating the TiSpark Classic Architecture, this decision is driven by our commitment to delivering the best possible experience to TiDB users. We remain dedicated to maintaining forward compatibility, promoting innovation, and enhancing the efficiency of our products and services. Together, we believe we can pave the way for future advancements, enabling users to leverage the true power of the TiDB ecosystem.

Goals

Improve Reliability and Maintainability: The primary objective of deprecating the TiSpark classic architecture is to enhance the overall reliability and maintainability of the TiSpark product. By streamlining the architecture and removing the bypass TiDB logic, we can take advantage of the mature distributed TiDB service. This strategic move aims to identify and eliminate potential failure points, reducing the occurrence of system crashes and improving the platform's stability. As a result, we will deliver a more reliable and efficient user experience, ensuring that TiSpark performs optimally in demanding production environments.
Improving User Experience: One of the main goals in deprecating the TiSpark classic architecture is to enhance the overall user experience. We plan to achieve this by simplifying the configuration process and eliminating complicated network topology prerequisites. Our aim is to provide users with a more user-friendly and intuitive experience, allowing them to effortlessly set up and deploy TiSpark just like any other Spark datasources without any additional hassle. By removing unnecessary complexities, we can make it easier for newcomers to quickly get started with TiSpark and promote its widespread adoption.
Streamline Architecture and Reduce Maintenance Efforts: The deprecation of the classic architecture also aims to streamline the overall architecture of TiSpark. By removing outdated components and simplifying the system design, we can reduce the maintenance efforts required to keep TiSpark up and running. This will free up valuable resources and allow the team to focus on the most important enhancements, ultimately benefiting both the product and the end-users.
Ensure Seamless Transition: Ensuring a seamless transition from the current generation of TiSpark to the successor product is a crucial objective. We understand the importance of minimizing disruptions and inconveniences for our users during this migration process. By providing clear documentation, proactive support, and backward compatibility where possible, we aim to make the transition as smooth as possible for our existing TiSpark users, ensuring that they can continue leveraging the benefits of the new architecture without major hassle.

By achieving these objectives, we aim to create a more reliable, user-friendly, and maintainable TiSpark product, ultimately enhancing the overall experience for both new and existing users.

Architecture

This section discusses the new architectural design for the next generation TiSpark. The focus of this design is to improve performance, overcome data screw issues, handle large transactions effectively, and enable easy integration with fully managed TiDB services on cloud.

The new architecture will be based on generic Spark JDBC support. By leveraging Spark's JDBC capabilities, we can tap into its powerful data processing and analytical functionalities.

To address the data screw issue commonly observed in Spark JDBC data sources, we introduce a new SQL statement SHOW TABLE SPLITS to retrieve the primary key ranges for the table. This ensures that the data is evenly partitioned at the region level, typically 128MB each. With this fine-grained and balanced task definition, Spark can offer higher concurrency based on the user's preferences, while avoiding data skew issues.

In order to overcome the limitations of large transactions when accessing TiDB with JDBC, we intentionally break down large transactions into smaller units of work and commit them individually. Although this approach has performance regressions comparing to the Classic bypass TiDB TiSpark architecture, it is more important to ensure data integrity and correctness. In the past, TiSpark had faced issues causing data corruption, so prioritizing data integrity over risky large transaction support is crucial.

Furthermore, with this new architecture design, projections and predicates can be easily pushed down to the JDBC layer. This allows for more efficient query execution by pushing down filtering and column projection operations closer to the data source.

Lastly, accessing TiDB directly through this new design makes it possible to seamlessly work with fully managed TiDB services in the cloud. Users can take advantage of TiDB's scalability, reliability, and automated management features, making it an ideal choice for cloud-based deployments.

Overall, the new architecture design for TiSpark aims to improve performance, overcome data screw issues, handle large transactions effectively, and provide easy integration with fully managed TiDB services on the cloud.

Migration checklist

This section provides an overview of the migration checklist for end users who are transitioning from the Classic TiSpark architecture to the new JDBC-based TiSpark architecture. The migration process entails considering multiple factors to facilitate a seamless and prosperous transition.

Review System Requirements: Assess the system requirements for the new JDBC-based TiSpark architecture. Since TiSpark utilizes TiDB servers for loading extensive data, users should either add more TiDB servers or ensure that the current setup possesses sufficient reserved capacity. Alternatively, users could setup separate TiDB servers exclusively for TiSpark to separate the ELT workload from online transactional workloads, providing enhanced protection for transactional workloads.
Perform Compatibility Testing: The JDBC datasource is a vital component integrated into Spark distribution, offering improved compatibility. Nevertheless, it is strongly advised to perform comprehensive compatibility testing to ensure seamless integration of existing applications and components with the new JDBC-based TiSpark architecture. It is crucial to verify that existing Spark jobs, SQL queries, and data processing pipelines operate as expected within this new environment.
Performance Testing and Optimization: Conduct thorough performance testing and optimization of the new JDBC-based TiSpark. Compare the performance metrics with the Classic TiSpark to identify significant performance improvements or regressions. Fine-tune the new architecture to optimize resource utilization, data processing efficiency, and query performance.

By following this migration checklist, users can transition from the Classic TiSpark to the new architecture smoothly, minimizing any potential disruptions while simultaneously enjoying improved performance, reliability, and overall system efficiency.

Conclusion

The TiSpark sunset project serves as a pivotal step towards enhancing the TiSpark product's maintainability, reliability, and user experience. By adhering to the objectives, scope, timeline, and key responsibilities mentioned in this document, We aim to seamlessly transition from the current generation TiSpark to a more efficient and feature-rich successor product.

FAQ

What level of support do we have for Classic TiSpark Existing releases of Classic TiSpark is supported until EOL of corresponding TiDB releases. Existing users are not affected by product evolution. However, support of the newer TiDB version will only be available on the new architecture.
How does the next generation TiSpark utilize resources with the existing resource manager? When the network policy still allows TiSpark to communicate directly with PD and TiKV servers, TiDB is configured in embedded mode, which runs as a subprocess of TiSpark executors. There is no need to scale out the existing TiDB cluster.
How many additional resources are required? The estimated additional cost would be less than 20%. The final report on resource consumption will be included in the release notes for the relevant TiSpark release.
How to use PD follower read with Next Generation TiSpark To use PD follower read with TiSpark, you must first upgrade the source TiDB cluster to the earliest LTS version that has PD follower read support. If the production cluster is experiencing PD scalability issues, upgrading is necessary to mitigate the issue.

References

Pros and Cons of Spark with Generic JDBC

Pros

Universal and builtin support for literally all database products
Nearly all databases have type 4 implementation, which is all native java code
Cons
ELT traffic may affect the performance of online transactional workload
Data skew is inevitable without tuning code based on the table schema and data distribution.

Pros and Cons of Classic TiSpark

Pros

The resource manager allocates all the necessary compute resources which are best suited for the big data scenarios.
There is no data skew under any circumstances.
Cons
TiSpark needs to be deployed within the same network where TiKV and PD reside and should be granted access to the entire cluster in terms of access control.
The implementation of TiSpark's transactional logic is closely linked with TiDB. Incompatible TiSpark version-matching with TiDB could result in serious issues which must be avoided at all costs.

Pros and Cons of Next Generation TiSpark powered by JDBC

Pros

The resource manager allocates all the necessary compute resources which are best suited for the big data scenarios.
There is no data screw under any circumstances.
TiSpark is a client of TiDB that works as regular TiDB applications. PD and TiKV can be protected by VPCs or network policies, making them safer and more compliant with security policies.
TiDB handles the correctness of transaction semantics. Any enhancement that becomes available to TiDB is immediately available to TiSpark.
Cons
The process of transferring data between TiDB and TiSpark consumes extra compute resources.

pingcap / tispark