Open maxim-lixakov opened 2 months ago
Hello!
The reason probably is that your Greenplum instance, running inside a docker container, is not able to reach the gpfdist server which the connector start in the context of your Spark session.
For example, as it could be seen from the log, in this moment the gpfdist server is listening on the following address:
gpfdist://10.195.113.139:56701/output.pipe
, where port 56701 is a dynamic random port that change on every operation.
So you can try adjust your network routing rules somehow to let containerized application reach arbitrary TCP port
on the 10.195.113.139 address.
However, we don't recommend running Greenplum or Spark in a container, because we didn't tested such a scenario and believe it has no much sense.
Also there could be yet another network related problem in your setup:
the second log reveals address 192.168.1.69: how is it related to 10.195.113.139 ?
Do you have several network cards?
Or network configuration changed between passes ?
The reason probably is that your Greenplum instance, running inside a docker container, is not able to reach the gpfdist server which the connector start in the context of your Spark session.
Reading data from Greenplum container to Spark executor is working. But for some reason it takes a minute to read a table with just 4 rows.
Also writing from Spark executor to the same Greenplum container is failing with timeout. Having network access INSERT INTO WRITABLE EXTERNAL TABLE -> Spark executor gpfdist server
but not having network access SELECT FROM REABABLE EXTERNAL TABLE -> Spark executor gpfdist server
does not sound plausible for me.
Oh, I see, you are right - reading "somehow" works, and writing doesn't at all.
By the way, I doubt that applying the guessMaxParallelTasks patch you mentioned is a good idea.
The purpose of guessMaxParallelTasks is to find the number of executor instances and it doesn't necessary correlate at all with number of partitions in the DataFrame (spark.default.parallelism).
I will try to reproduce your case, but it can take some time.
For a while I'd play with number of threads as in this post
Description
The Spark-Greenplum connector does not work correctly in local Spark mode (local-master). When performing read operations, there is a significant waiting time, and write operations crash with a timeout error. The problem is reproduced on Spark 3.x versions when writing data to Greenplum via
spark-greenplum-connector_2.12-3.1.jar
.Steps to reproduce
To reproduce this issue you need:
docker-compose.yml
:create database test
CREATE TABLE public.employee ( employee_id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
birth_date DATE,
salary NUMERIC(10, 2),
is_active BOOLEAN
) DISTRIBUTED BY (employee_id);
INSERT INTO public.employee (first_name, last_name, birth_date, salary, is_active) VALUES ('John', 'Doe', '1985-05-15', 55000.00, TRUE), ('Jane', 'Smith', '1990-10-25', 62000.00, TRUE), ('Mark', 'Johnson', '1978-03-12', 70000.00, FALSE), ('Lucy', 'Williams', '1983-07-19', 48000.00, TRUE);
SELECT * FROM employee;
Logs
logs while trying to read data from table:
as can be seen from the logs between the interaction of
RMISlave
andRMIMaster
takes more than 30 seconds.logs while trying to write data to table:
Environment
Connector version: spark-greenplum-connector_2.12-3.1.jar Java version, Scala version: Java 1.8.0, Scala 2.12 OS: macOS 13.4.1 (22F82)