pgspider / parquet_s3_fdw

ParquetS3 Foreign Data Wrapper for PostgresSQL
Other
202 stars 28 forks source link

GCS #18

Open CMBCKSRL opened 1 year ago

CMBCKSRL commented 1 year ago

I am trying to use parquet_s3_fdw to connect to my GCS bucket and extract data from parquet files but it seems to be impossible (or I've made a mistake in my code).

here is what I do

Firstly, I create EXTENSION CREATE EXTENSION parquet_s3_fdw;

Than I create server CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');

My GCS bucket region is us-west1 (Oregon) but I also tried us-west2.

Afterwards, I create user mapping CREATE USER MAPPING FOR CURRENT_USER SERVER parquet_s3_srv OPTIONS (user '<access_key>', password '<secret_key>');

I don't think that there is a problem with these keys because I was able to access to my bucket from ClickHouse.

In the end I create foreign table

CREATE FOREIGN TABLE natality_parquet (
  source_year TEXT,
  year TEXT,
  month TEXT,
  day TEXT,
  wday TEXT,
  state TEXT,
  is_male TEXT,
  child_race TEXT,
  weight_pounds TEXT,
  plurality TEXT,
  apgar_1min TEXT,
  apgar_5min TEXT,
  mother_residence_state TEXT,
  mother_race TEXT,
  mother_age TEXT,
  gestation_weeks TEXT,
  lmp TEXT,
  mother_married TEXT,
  mother_birth_state TEXT,
  cigarette_use TEXT,
  cigarettes_per_day TEXT,
  alcohol_use TEXT,
  drinks_per_week TEXT,
  weight_gain_pounds TEXT,
  born_alive_alive TEXT,
  born_alive_dead TEXT,
  born_dead TEXT,
  ever_born TEXT,
  father_race TEXT,
  father_age TEXT,
  record_weight TEXT
) SERVER parquet_s3_srv
OPTIONS (
  filename 's3://example_bucket_natality2/000000000000.parquet'
);

But when I query this foreign table I get this error select * from natality_parquet limit 5;

SQL Error [XX000]: ERROR: parquet_s3_fdw: failed to exctract row groups from Parquet file: failed to open Parquet file HeadObject failed

Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code

hrkuma commented 1 year ago

Hi,

Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code

Actually we have not tried GCS with this FDW. So we're sorry for not providing valuable information at this moment.

wuputah commented 1 year ago

Is it possible to use the endpoint option to connect to non-S3 services? Perhaps you need to enable use_minio but if the API is the same, it may work?

hrkuma commented 1 year ago

Because we did not find the exact information that AWS SDK C++ can work with GCS, by current implementation using AWS SDK, I cannot tell it may work or not. About the use_minio option, because the schema is hard coded with "Aws::Http::Schema::HTTP" in parquet_s3_fdw_connection.cpp, maybe we should fix it to HTTPS at least on GCS.

mausch commented 1 year ago

FWIW I'm getting the same error with a parquet file on AWS S3 so this is not just about GCS. A different parquet file in the same S3 bucket works fine though.

ZhiXingHeYiApple commented 1 year ago

FWIW I'm getting the same error with a parquet file on Ali OSS. I roughly looked at the source code and suspect that this piece of code has an issue.

parquet_s3_fdw_connection.cpp

if (use_minio)
    {
        const Aws::String defaultEndpoint = "127.0.0.1:9000";
        clientConfig.scheme = Aws::Http::Scheme::HTTP;
        clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint : defaultEndpoint;
        s3_client = new Aws::S3::S3Client(cred, clientConfig,
                Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, false);
    }
    else
    {
        const Aws::String defaultRegion = "ap-northeast-1";
        clientConfig.scheme = Aws::Http::Scheme::HTTPS;
        clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
        s3_client = new Aws::S3::S3Client(cred, clientConfig);
    }

I think else block of code absent clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint. May be correct code as below:

else
    {
        const Aws::String defaultRegion = "ap-northeast-1";
        clientConfig.scheme = Aws::Http::Scheme::HTTPS;
        clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
                 // May be there are default value(on AWS S3) for endpoint of clientConfig, but if you use GCS or Ali OSS, you should specify the endpoint from external configuration.
                 clientConfig.endpointOverride = (Aws::String) endpoint;
        s3_client = new Aws::S3::S3Client(cred, clientConfig);
    }
vitabaks commented 1 year ago

+1 to support GCS.

vitabaks commented 1 year ago

AWS S3 (installed according to the instruction)

postgres=# CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user '*******', password '********');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE test_table (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://vitabaks/parquet-test/test_table.parquet');
CREATE FOREIGN TABLE
postgres=# INSERT INTO test_table VALUES (1, 'text1', true), (2, DEFAULT, false), ((select 3), (select i from (values('values are fun!')) as foo (i)), true);
ERROR:  parquet_s3_fdw: failed to open Parquet file HeadObject failed
postgres=# 

access to this bucket (using the same Key and Secret that were specified in USER MAPPING) is available, which is confirmed by the aws-cli:

root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks
                           PRE parquet-test/
                           PRE walg/
root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks/parquet-test/
2023-11-24 19:17:36          0 

Does it look like there is a problem with the extension or its dependencies?

son-phamngoc commented 10 months ago

Hello, thanks for your reporting.

According to the behavior, I think the problem belongs to proxy. I can reproduce this problem when using parquet_s3_fdw under proxy environment. I would like to confirm: Are you working under proxy environment? To connect to AWS S3 through proxy, proxy setting of ClientConfiguration must be set. They include proxyScheme, proxyHost, proxyPort, proxyUserName and proxyPassword.

Current implementation of use_minio, region and endpoint options are not good. It only allows to choose using region or endpoint to connect to AWS S3 based on value of use_minio option, while actually, AWS SDK can use both options to connect. They should be updated.

According to your comments and above situation, I attached a patch file to temporally change the implementation. You can use region, endpoint or both to connect. parquet_s3_fdw_issue_18.patch If you are working under proxy environment, please update proxy value in patch file before applying. If you are not, please remove proxy setting in patch file. However, because parquet_s3_fdw has never been tested with GCS or Ali OSS, I'm not sure if it can work well with it. Could you try it and let me know the result?

If this patch can solve your problems, we will apply it in next release. If your problem is different, please provide us more detailed information how to reproduce it.

CHEN-Jing194 commented 6 months ago

I have the same question, has this issue been resolved? I did not use any proxies.

son-phamngoc commented 6 months ago

@CHEN-Jing194 Thanks for your report.

The root cause is not clear, so the issue has not been resolved.
Possibly, the hard-coded scheme (HTTP) is the root cause. Could you help me to try this patch file? In this patch file, I supported a new option scheme for SERVER. Please specify the scheme http or https when creating SERVER and try again. Thank you for your support.

parquet_s3_fdw_issue_18_2.patch

CHEN-Jing194 commented 5 months ago

@CHEN-Jing194 Thanks for your report.

The root cause is not clear, so the issue has not been resolved. Possibly, the hard-coded scheme (HTTP) is the root cause. Could you help me to try this patch file? In this patch file, I supported a new option scheme for SERVER. Please specify the scheme http or https when creating SERVER and try again. Thank you for your support.

parquet_s3_fdw_issue_18_2.patch

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

son-phamngoc commented 5 months ago

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

@CHEN-Jing194 Please create a foreign table which specifies key column. When you execute an INSERT query, a new parquet file will be created automatically on S3. Example:

```sql
CREATE FOREIGN TABLE example_insert (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet');

INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);

And please let me know if you face the original problem of this issue.

CHEN-Jing194 commented 5 months ago

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

@CHEN-Jing194 Please create a foreign table which specifies key column. When you execute an INSERT query, a new parquet file will be created automatically on S3. Example:

```sql
CREATE FOREIGN TABLE example_insert (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet');

INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);

And please let me know if you face the original problem of this issue.

image This issue still exists.🥹

son-phamngoc commented 5 months ago

@CHEN-Jing194, Sorry for unclear explanation. My above example is not correct. It is used for the case that the parquet file has already existed. Because filename option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.

In case of no parquet file exists in S3, you need to use dirname option of FOREIGN TABLE. There are 2 kinds of usage for dirname option to generate parquet file automatically:

contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)'); CREATE FOREIGN TABLE

contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false); INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw 2024-06-25 11:05:13 713 new_file.parquet



Please let me know if there is any problem.
CHEN-Jing194 commented 5 months ago

@CHEN-Jing194, Sorry for unclear explanation. My above example is not correct. It is used for the case that the parquet file has already existed. Because filename option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.

In case of no parquet file exists in S3, you need to use dirname option of FOREIGN TABLE. There are 2 kinds of usage for dirname option to generate parquet file automatically:

  • Specify dirname option: A new file with file name format [dirname]/[table_name]-[current_time].parquet is generated automatically.
contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw')
CREATE FOREIGN TABLE
contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 10:47:23        713 example_insert-2024-06-25 03:47:22.572209.parquet
  • Specify insert_file_selector option: insert_file_selector allows user to define rule to generate file name. A new file follows this rule will be generated. For example, I would like to specify that generated file must have format [dirname]/[new_file.parquet]
contrib_regression=# CREATE FUNCTION selector(dirname text)
RETURNS TEXT AS
$$
    SELECT dirname || '/new_file.parquet';
$$
LANGUAGE SQL;

contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)');
CREATE FOREIGN TABLE

contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 11:05:13        713 new_file.parquet

Please let me know if there is any problem.

But how do I set the Access Key Id? I only see options to set a username and password. image

son-phamngoc commented 5 months ago

But how do I set the Access Key Id? I only see options to set a username and password.

You can set username as Access Key, and password as Secret access key when creating USER MAPPING.

CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');
CHEN-Jing194 commented 5 months ago

But how do I set the Access Key Id? I only see options to set a username and password.

You can set username as Access Key, and password as Secret access key when creating USER MAPPING.

CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');

image Now the error is different. It seems like the file still hasn't been created on S3.

son-phamngoc commented 5 months ago

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

CHEN-Jing194 commented 5 months ago

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again. image

CHEN-Jing194 commented 5 months ago

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again. image

maybe something wrong with the create region of s3 bucket?

CHEN-Jing194 commented 5 months ago

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again. image

maybe something wrong with the create region of s3 bucket?

image failed 🤧

son-phamngoc commented 5 months ago

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000). region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');
CHEN-Jing194 commented 5 months ago

I specified the region but there is still a problem. image

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000). region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');

I specified the region but there is still a problem. image

CHEN-Jing194 commented 5 months ago

I specified the region but there is still a problem. image

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000). region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');

I specified the region but there is still a problem. image

However, I can still use pg_tier normally, which is a bit strange.

CHEN-Jing194 commented 5 months ago

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

son-phamngoc commented 4 months ago

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

@CHEN-Jing194 No problem. Do you have any other issue with this FDW?

CHEN-Jing194 commented 4 months ago

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

@CHEN-Jing194 No problem. Do you have any other issue with this FDW?

Can foreign tables be partitioned? This should help reduce the amount of data scanned and lower costs.

son-phamngoc commented 4 months ago

@CHEN-Jing194 Sorry for late response.

Could you confirm which way of the following understanding matches your expectation?

  1. Use multiple foreign tables as partitions of a partitioned table image For this way, parquet_s3_fdw can support it. You can prepare a similar model to use. The data must be split into multiple parquet files, and several foreign tables must be created to create that model. When querying, partitioned table can scan only the corresponding foreign table, without scanning all foreign tables. You can refer to test file partition.sql to see the example.

  2. Use only 1 foreign table. Depending on the condition of query, foreign table scans corresponding part of parquet file without scanning the whole file. image This way of understanding matches the "chunk" structure of parquet file: https://parquet.apache.org/docs/file-format/ If this is your expectation, current parquet_s3_fdw does not support it.

CHEN-Jing194 commented 4 months ago

@son-phamngoc Hello, I found the partition.sql file and did some testing according to the examples. I found some strange phenomena. My query is correct only when the where condition of the query includes the partition key and only involves one partition. Otherwise, the returned records will be duplicated, as shown in the image below. image

Additionally,

  1. I cannot delete data image

  2. When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table. image

son-phamngoc commented 4 months ago

@CHEN-Jing194 Could you send me the SQL that you used to create partitioned table and foreign tables? Are your example_part1 and example_part2 foreign tables referring to the same parquet file? If so, the data will be duplicated when both foreign tables are scanned.

I cannot delete data

Update and Delete can work only when key columns are specified. Please alter the foreign table to specify key column using key option. image

When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table.

Thank you for reporting. I can reproduce this problem. Possibly this is a bug of parquet_s3_fdw. We will fix it in next releases.

CHEN-Jing194 commented 4 months ago

@son-phamngoc image image image

son-phamngoc commented 4 months ago

@CHEN-Jing194 Your 2 foreign tables are referring to the same dirname. When you executed the 1st INSERT query on example_part2 foreign table, a new parquet file was created with format [dirname]/[table_name]-[current_time].parquet. Next, when you executed the 2nd INSERT query on example_part1, parquet_s3_fdw looked for the parquet file which has that dirname and the schema matches schema of foreign table. Because example_part1 and example_part2 has the same schema and same dirname, data was inserted into the same parquet file. You can check the file on S3. Only 1 parquet file was created. Therefore, when you select data, both example_part1 and example_part2 are referring to the same parquet file, so data is duplicated. If you set different values for 2 foreign tables, 2 separated parquet files will be generated, and you will not get duplicated data.

CHEN-Jing194 commented 4 months ago

@son-phamngoc Thank you for patiently answering my questions. Yes, that's correct. It worked when I pointed to two different folders separately. Everything is running fine except for the insert operation mentioned above. Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?

son-phamngoc commented 4 months ago

@CHEN-Jing194

Thank you for patiently answering my questions.

No problem. You are welcome.

Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?

parquet_s3_fdw uses key columns values to find the correct target record to be updated/deleted. If you use only 1 key column, the value in that column should be unique. If you use multiple key columns, the value in a key column can be duplicated, but the combination of key columns values should be unique. If you do not ensure the uniqueness, parquet_s3_fdw can find wrong target and then update/delete wrong records. For example: I have a parquet file and a foreign table t1 which has 3 columns: id (int) as key column, c1 (int), c2 (text). Data is as following:

|id|c1|c2|
|1|200|aaa|
|1|100|bbb|

I want to update all records which has c1 = 100, so SQL query is: UPDATE t1 SET c2 = 'ccc' WHERE c1 = 100. Firstly, parquet_s3_fdw gets all records from parquet file, cache it and then iterate each record. The 1st record |1|200|aaa| does not match condition c1 = 100, so skip. The 2nd record |1|100|bbb| matches condition c1 = 100 so its key value is retrieved (id = 1). Next, parquet_s3_fdw uses key value id = 1 to search in cache to find the record to be updated. And then, it found the 1st record in cache which matches condition id = 1, and update that record. However, that record has c1 = 200, so it is not correct target.

CHEN-Jing194 commented 4 months ago

@son-phamngoc I don't have any more questions. Now I feel like I can use this extension correctly. Thanks, guys!

son-phamngoc commented 4 months ago

@CHEN-Jing194 I'm glad to hear that. I have a question: Did you use my patch in https://github.com/pgspider/parquet_s3_fdw/issues/18#issuecomment-2133065655? Did you meet the connection issue if not using my patch?

CHEN-Jing194 commented 4 months ago

@son-phamngoc No, I did not use that patch. The initial issue I encountered was simply because there was no corresponding file on s3.

son-phamngoc commented 4 months ago

@CHEN-Jing194 Thank you for your answer.

@CMBCKSRL @mausch @ZhiXingHeYiApple @vitabaks Would you try my patch file at https://github.com/pgspider/parquet_s3_fdw/issues/18#issuecomment-2133065655 and notify me if it can fix your issue? Your feedback is useful for us to improve the quality of this FDW. Thank you for your support.

CHEN-Jing194 commented 3 months ago

@son-phamngoc Hello, I have another question. Is fdw requesting the corresponding parquet file to the local and then performing SQL operations? I tested it and found that the traffic is quite high. I originally thought it was using the functionality of s3 select, but it seems like it's not.