oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 77 forks source link

ArrowDataSource not supporting gs (Google Cloud Storage Buckets) for storage #443

Open HongW2019 opened 3 years ago

HongW2019 commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. When we enabled Gazelle on Google Cloud Dataproc with gs (Google Cloud Storage Buckets) for storage instead of HDFS, found that gs now wasn't supported.

bucket-issue

Describe the solution you'd like Now ArrowDataSource supports S3, and Google Dataproc Spark supports gs for cloud storage. If we want to add the cloud storage supporting on on Dataproc clusters, we also need add the gs support for Google Cloud Storage.

HongW2019 commented 3 years ago

@zhztheplayer @weiting-chen @zhouyuan Please take a review, thanks a lot.

zhixingheyi-tian commented 3 years ago

@HongW2019 If you have tested out the Spark TPCDS benchmark can run on GS. You can paste the script and result in this issue. Just for baseline.

zhztheplayer commented 3 years ago

This issue is depending on an upstream topic from Arrow community about GCS support https://issues.apache.org/jira/browse/ARROW-1231

HongW2019 commented 3 years ago

@zhixingheyi-tian

issue-answer2 issue-answer1