vertica / spark-connector

This component acts as a bridge between Spark and Vertica, allowing the user to either retrieve data from Vertica for processing in Spark, or store processed data from Spark into Vertica.
Apache License 2.0
20 stars 22 forks source link

[ENHANCMENT] Support for loading multiple tables #557

Open williammatherjones opened 7 months ago

williammatherjones commented 7 months ago

Environment


Problem Description

Describe the issue in as much details as possible, so it is possible to reproduce it.

The Spark connector is instantiating only one parallel JDBC connection. When one table completes its data loading, it closes the connection. Since the JDBC connection is defined as singleton in the code, it prevents other connections for clerical tasks such as table/column definition check. In order to work with this configuration, the connector will need to be enhanced to handle multiple threads.

Steps to reproduce:

Here is what we understood about how customer's job run -

  1. Use kafka to write data in aws s3 files.
  2. Then customer's code is submitted to a spark shell to read these files from s3, perform a few transformations.
  3. This transformed data is then written to vertica using vertica spark connector.

Customer's code is having the ability to run the load and transform for multiple tables. and they claim not facing above issue when they used the vertica's legacy spark connector (when they used vertica 9.1.x).

Expected behaviour:

In our tests, we found out below observations-

  1. The customer never faces an issue if they run code for a single table.
  2. The spark job fails if submitting the code for multiple tables
  3. Actual behaviour:
  4. Error message/stack trace:
  5. Code sample or example on how to reproduce the issue:

Spark Connector Logs