Open CMBCKSRL opened 1 year ago
Hi,
Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code
Actually we have not tried GCS with this FDW. So we're sorry for not providing valuable information at this moment.
Is it possible to use the endpoint
option to connect to non-S3 services? Perhaps you need to enable use_minio
but if the API is the same, it may work?
Because we did not find the exact information that AWS SDK C++ can work with GCS, by current implementation using AWS SDK, I cannot tell it may work or not. About the use_minio option, because the schema is hard coded with "Aws::Http::Schema::HTTP" in parquet_s3_fdw_connection.cpp, maybe we should fix it to HTTPS at least on GCS.
FWIW I'm getting the same error with a parquet file on AWS S3 so this is not just about GCS. A different parquet file in the same S3 bucket works fine though.
FWIW I'm getting the same error with a parquet file on Ali OSS. I roughly looked at the source code and suspect that this piece of code has an issue.
parquet_s3_fdw_connection.cpp
if (use_minio)
{
const Aws::String defaultEndpoint = "127.0.0.1:9000";
clientConfig.scheme = Aws::Http::Scheme::HTTP;
clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint : defaultEndpoint;
s3_client = new Aws::S3::S3Client(cred, clientConfig,
Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, false);
}
else
{
const Aws::String defaultRegion = "ap-northeast-1";
clientConfig.scheme = Aws::Http::Scheme::HTTPS;
clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
s3_client = new Aws::S3::S3Client(cred, clientConfig);
}
I think else block of code absent clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint
. May be correct code as below:
else
{
const Aws::String defaultRegion = "ap-northeast-1";
clientConfig.scheme = Aws::Http::Scheme::HTTPS;
clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
// May be there are default value(on AWS S3) for endpoint of clientConfig, but if you use GCS or Ali OSS, you should specify the endpoint from external configuration.
clientConfig.endpointOverride = (Aws::String) endpoint;
s3_client = new Aws::S3::S3Client(cred, clientConfig);
}
+1 to support GCS.
AWS S3 (installed according to the instruction)
postgres=# CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user '*******', password '********');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE test_table (
c1 INT2 OPTIONS (key 'true'),
c2 TEXT,
c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://vitabaks/parquet-test/test_table.parquet');
CREATE FOREIGN TABLE
postgres=# INSERT INTO test_table VALUES (1, 'text1', true), (2, DEFAULT, false), ((select 3), (select i from (values('values are fun!')) as foo (i)), true);
ERROR: parquet_s3_fdw: failed to open Parquet file HeadObject failed
postgres=#
access to this bucket (using the same Key and Secret that were specified in USER MAPPING) is available, which is confirmed by the aws-cli:
root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks
PRE parquet-test/
PRE walg/
root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks/parquet-test/
2023-11-24 19:17:36 0
Does it look like there is a problem with the extension or its dependencies?
Hello, thanks for your reporting.
According to the behavior, I think the problem belongs to proxy. I can reproduce this problem when using parquet_s3_fdw under proxy environment. I would like to confirm: Are you working under proxy environment? To connect to AWS S3 through proxy, proxy setting of ClientConfiguration must be set. They include proxyScheme, proxyHost, proxyPort, proxyUserName and proxyPassword.
Current implementation of use_minio
, region
and endpoint
options are not good. It only allows to choose using region or endpoint to connect to AWS S3 based on value of use_minio
option, while actually, AWS SDK can use both options to connect.
They should be updated.
According to your comments and above situation, I attached a patch file to temporally change the implementation. You can use region, endpoint or both to connect. parquet_s3_fdw_issue_18.patch If you are working under proxy environment, please update proxy value in patch file before applying. If you are not, please remove proxy setting in patch file. However, because parquet_s3_fdw has never been tested with GCS or Ali OSS, I'm not sure if it can work well with it. Could you try it and let me know the result?
If this patch can solve your problems, we will apply it in next release. If your problem is different, please provide us more detailed information how to reproduce it.
I have the same question, has this issue been resolved? I did not use any proxies.
@CHEN-Jing194 Thanks for your report.
The root cause is not clear, so the issue has not been resolved.
Possibly, the hard-coded scheme (HTTP) is the root cause.
Could you help me to try this patch file?
In this patch file, I supported a new option scheme
for SERVER. Please specify the scheme http
or https
when creating SERVER and try again.
Thank you for your support.
@CHEN-Jing194 Thanks for your report.
The root cause is not clear, so the issue has not been resolved. Possibly, the hard-coded scheme (HTTP) is the root cause. Could you help me to try this patch file? In this patch file, I supported a new option
scheme
for SERVER. Please specify the schemehttp
orhttps
when creating SERVER and try again. Thank you for your support.
I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?
I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?
@CHEN-Jing194
Please create a foreign table which specifies key
column. When you execute an INSERT query, a new parquet file will be created automatically on S3.
Example:
```sql
CREATE FOREIGN TABLE example_insert (
c1 INT2 OPTIONS (key 'true'),
c2 TEXT,
c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet');
INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
And please let me know if you face the original problem of this issue.
I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?
@CHEN-Jing194 Please create a foreign table which specifies
key
column. When you execute an INSERT query, a new parquet file will be created automatically on S3. Example:```sql CREATE FOREIGN TABLE example_insert ( c1 INT2 OPTIONS (key 'true'), c2 TEXT, c3 BOOLEAN ) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet'); INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
And please let me know if you face the original problem of this issue.
This issue still exists.🥹
@CHEN-Jing194, Sorry for unclear explanation.
My above example is not correct. It is used for the case that the parquet file has already existed. Because filename
option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.
In case of no parquet file exists in S3, you need to use dirname
option of FOREIGN TABLE.
There are 2 kinds of usage for dirname
option to generate parquet file automatically:
dirname
option: A new file with file name format [dirname]/[table_name]-[current_time].parquet
is generated automatically.
contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw')
CREATE FOREIGN TABLE
contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2
You can see the generated file on S3:
[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 10:47:23 713 example_insert-2024-06-25 03:47:22.572209.parquet
insert_file_selector
option: insert_file_selector
allows user to define rule to generate file name. A new file follows this rule will be generated.
For example, I would like to specify that generated file must have format [dirname]/[new_file.parquet]
contrib_regression=# CREATE FUNCTION selector(dirname text)
RETURNS TEXT AS
$$
SELECT dirname || '/new_file.parquet';
$$
LANGUAGE SQL;
contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)'); CREATE FOREIGN TABLE
contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false); INSERT 0 2
You can see the generated file on S3:
[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw 2024-06-25 11:05:13 713 new_file.parquet
Please let me know if there is any problem.
@CHEN-Jing194, Sorry for unclear explanation. My above example is not correct. It is used for the case that the parquet file has already existed. Because
filename
option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.In case of no parquet file exists in S3, you need to use
dirname
option of FOREIGN TABLE. There are 2 kinds of usage fordirname
option to generate parquet file automatically:
- Specify
dirname
option: A new file with file name format[dirname]/[table_name]-[current_time].parquet
is generated automatically.contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw') CREATE FOREIGN TABLE contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false); INSERT 0 2
You can see the generated file on S3:
[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw 2024-06-25 10:47:23 713 example_insert-2024-06-25 03:47:22.572209.parquet
- Specify
insert_file_selector
option:insert_file_selector
allows user to define rule to generate file name. A new file follows this rule will be generated. For example, I would like to specify that generated file must have format [dirname]/[new_file.parquet]contrib_regression=# CREATE FUNCTION selector(dirname text) RETURNS TEXT AS $$ SELECT dirname || '/new_file.parquet'; $$ LANGUAGE SQL; contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)'); CREATE FOREIGN TABLE contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false); INSERT 0 2
You can see the generated file on S3:
[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw 2024-06-25 11:05:13 713 new_file.parquet
Please let me know if there is any problem.
But how do I set the Access Key Id? I only see options to set a username and password.
But how do I set the Access Key Id? I only see options to set a username and password.
You can set username
as Access Key, and password
as Secret access key when creating USER MAPPING.
CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');
But how do I set the Access Key Id? I only see options to set a username and password.
You can set
username
as Access Key, andpassword
as Secret access key when creating USER MAPPING.CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');
Now the error is different. It seems like the file still hasn't been created on S3.
Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?
Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?
Of course, I went through the process again.
Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?
Of course, I went through the process again.
maybe something wrong with the create region of s3 bucket?
Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?
Of course, I went through the process again.
maybe something wrong with the create region of s3 bucket?
failed 🤧
The SERVER must be created with option region
or endpoint
. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000).
region
and endpoint
are options of SERVER. You can't use it for FOREIGN TABLE.
Please try again:
CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');
I specified the region but there is still a problem.
The SERVER must be created with option
region
orendpoint
. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000).region
andendpoint
are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');
I specified the region but there is still a problem.
I specified the region but there is still a problem.
The SERVER must be created with option
region
orendpoint
. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000).region
andendpoint
are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');
I specified the region but there is still a problem.
However, I can still use pg_tier normally, which is a bit strange.
@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭
@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭
@CHEN-Jing194 No problem. Do you have any other issue with this FDW?
@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭
@CHEN-Jing194 No problem. Do you have any other issue with this FDW?
Can foreign tables be partitioned? This should help reduce the amount of data scanned and lower costs.
@CHEN-Jing194 Sorry for late response.
Could you confirm which way of the following understanding matches your expectation?
Use multiple foreign tables as partitions of a partitioned table For this way, parquet_s3_fdw can support it. You can prepare a similar model to use. The data must be split into multiple parquet files, and several foreign tables must be created to create that model. When querying, partitioned table can scan only the corresponding foreign table, without scanning all foreign tables. You can refer to test file partition.sql to see the example.
Use only 1 foreign table. Depending on the condition of query, foreign table scans corresponding part of parquet file without scanning the whole file. This way of understanding matches the "chunk" structure of parquet file: https://parquet.apache.org/docs/file-format/ If this is your expectation, current parquet_s3_fdw does not support it.
@son-phamngoc Hello, I found the partition.sql file and did some testing according to the examples. I found some strange phenomena. My query is correct only when the where condition of the query includes the partition key and only involves one partition. Otherwise, the returned records will be duplicated, as shown in the image below.
Additionally,
I cannot delete data
When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table.
@CHEN-Jing194
Could you send me the SQL that you used to create partitioned table and foreign tables?
Are your example_part1
and example_part2
foreign tables referring to the same parquet file?
If so, the data will be duplicated when both foreign tables are scanned.
I cannot delete data
Update and Delete can work only when key columns are specified.
Please alter the foreign table to specify key column using key
option.
When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table.
Thank you for reporting. I can reproduce this problem. Possibly this is a bug of parquet_s3_fdw. We will fix it in next releases.
@son-phamngoc
@CHEN-Jing194
Your 2 foreign tables are referring to the same dirname
.
When you executed the 1st INSERT query on example_part2 foreign table, a new parquet file was created with format [dirname]/[table_name]-[current_time].parquet.
Next, when you executed the 2nd INSERT query on example_part1, parquet_s3_fdw looked for the parquet file which has that dirname
and the schema matches schema of foreign table. Because example_part1 and example_part2 has the same schema and same dirname, data was inserted into the same parquet file.
You can check the file on S3. Only 1 parquet file was created.
Therefore, when you select data, both example_part1
and example_part2
are referring to the same parquet file, so data is duplicated.
If you set different values for 2 foreign tables, 2 separated parquet files will be generated, and you will not get duplicated data.
@son-phamngoc Thank you for patiently answering my questions. Yes, that's correct. It worked when I pointed to two different folders separately. Everything is running fine except for the insert operation mentioned above. Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?
@CHEN-Jing194
Thank you for patiently answering my questions.
No problem. You are welcome.
Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?
parquet_s3_fdw uses key columns values to find the correct target record to be updated/deleted. If you use only 1 key column, the value in that column should be unique. If you use multiple key columns, the value in a key column can be duplicated, but the combination of key columns values should be unique. If you do not ensure the uniqueness, parquet_s3_fdw can find wrong target and then update/delete wrong records. For example: I have a parquet file and a foreign table t1 which has 3 columns: id (int) as key column, c1 (int), c2 (text). Data is as following:
|id|c1|c2|
|1|200|aaa|
|1|100|bbb|
I want to update all records which has c1 = 100, so SQL query is: UPDATE t1 SET c2 = 'ccc' WHERE c1 = 100
.
Firstly, parquet_s3_fdw gets all records from parquet file, cache it and then iterate each record.
The 1st record |1|200|aaa|
does not match condition c1 = 100
, so skip.
The 2nd record |1|100|bbb|
matches condition c1 = 100
so its key value is retrieved (id = 1).
Next, parquet_s3_fdw uses key value id = 1
to search in cache to find the record to be updated.
And then, it found the 1st record in cache which matches condition id = 1
, and update that record.
However, that record has c1 = 200
, so it is not correct target.
@son-phamngoc I don't have any more questions. Now I feel like I can use this extension correctly. Thanks, guys!
@CHEN-Jing194 I'm glad to hear that. I have a question: Did you use my patch in https://github.com/pgspider/parquet_s3_fdw/issues/18#issuecomment-2133065655? Did you meet the connection issue if not using my patch?
@son-phamngoc No, I did not use that patch. The initial issue I encountered was simply because there was no corresponding file on s3.
@CHEN-Jing194 Thank you for your answer.
@CMBCKSRL @mausch @ZhiXingHeYiApple @vitabaks Would you try my patch file at https://github.com/pgspider/parquet_s3_fdw/issues/18#issuecomment-2133065655 and notify me if it can fix your issue? Your feedback is useful for us to improve the quality of this FDW. Thank you for your support.
@son-phamngoc Hello, I have another question. Is fdw requesting the corresponding parquet file to the local and then performing SQL operations? I tested it and found that the traffic is quite high. I originally thought it was using the functionality of s3 select, but it seems like it's not.
I am trying to use parquet_s3_fdw to connect to my GCS bucket and extract data from parquet files but it seems to be impossible (or I've made a mistake in my code).
here is what I do
Firstly, I create EXTENSION
CREATE EXTENSION parquet_s3_fdw;
Than I create server
CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');
My GCS bucket region is us-west1 (Oregon) but I also tried us-west2.
Afterwards, I create user mapping
CREATE USER MAPPING FOR CURRENT_USER SERVER parquet_s3_srv OPTIONS (user '<access_key>', password '<secret_key>');
I don't think that there is a problem with these keys because I was able to access to my bucket from ClickHouse.
In the end I create foreign table
But when I query this foreign table I get this error
select * from natality_parquet limit 5;
SQL Error [XX000]: ERROR: parquet_s3_fdw: failed to exctract row groups from Parquet file: failed to open Parquet file HeadObject failed
Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code