projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
990 stars 130 forks source link

[Bug]: Nessie Iceberg Minio access problem #9533

Closed MuslimBeibytuly closed 2 weeks ago

MuslimBeibytuly commented 2 weeks ago

What happened

    tx.append(df=df, snapshot_properties=snapshot_properties)
/usr/local/lib/python3.12/site-packages/pyiceberg/table/__init__.py:503: in append
    with append_method() as append_files:
/usr/local/lib/python3.12/site-packages/pyiceberg/table/__init__.py:2094: in __exit__
    self.commit()
/usr/local/lib/python3.12/site-packages/pyiceberg/table/__init__.py:2090: in commit
    self._transaction._apply(*self._commit())
/usr/local/lib/python3.12/site-packages/pyiceberg/table/__init__.py:3218: in _commit
    with write_manifest_list(
/usr/local/lib/python3.12/site-packages/pyiceberg/manifest.py:924: in __enter__
    self._writer.__enter__()
/usr/local/lib/python3.12/site-packages/pyiceberg/avro/file.py:258: in __enter__
    self.output_stream = self.output_file.create(overwrite=True)
/usr/local/lib/python3.12/site-packages/pyiceberg/io/pyarrow.py:307: in create
    output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)
pyarrow/_fs.pyx:887: in pyarrow._fs.FileSystem.open_output_stream
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: When initiating multiple part upload for key 'warehouse/default/integers_a1b6353a-e14c-4173-8835-212067035d29/metadata/snap-7205071987774274359-0-6f46cc66-eb51-4418-bb32-78e4df70cf9d.avro' in bucket 'seed': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: Access Denied.

How to reproduce it

docker-compose.yml:

services:
  postgres:
    container_name: postgres
    image: postgres:16.2-alpine3.19
    command: "
      -c fsync=off
      -c synchronous_commit=off
      -c full_page_writes=off
      -c max_wal_size=4096
      -c checkpoint_timeout=86400
    "
    environment:
      POSTGRES_DB: "db"
      POSTGRES_USER: "user"
      POSTGRES_PASSWORD: "pass"
    healthcheck:
      test: [
        "CMD", "pg_isready",
        "--username=user",
        "--dbname=db",
        "--host=127.0.0.1",
        "--port=5432",
      ]
      interval: 4s
      timeout: 4s
      retries: 8
      start_period: 4s

    minio:
    container_name: minio
    image: bitnami/minio:2024.4.6
    environment:
      MINIO_ROOT_USER: "S3_ACCESS_KEY"
      MINIO_ROOT_PASSWORD: "S3_SECRET_KEY"
      MINIO_DEFAULT_BUCKETS: "seed"
    ports:
      - "9000:9000"
      - "9001:9001"
    healthcheck:
      test: [
        "CMD", "mc", "ready", "local",
      ]
      interval: 4s
      timeout: 4s
      retries: 8
      start_period: 4s

  nessie:
    container_name: nessie
    image: ghcr.io/projectnessie/nessie:0.95.0
    depends_on:
      - postgres
      - minio
    ports:
      - "19120:19120"
      - "10000:9000"
    environment:
      - nessie.version.store.type=JDBC
      - nessie.version.store.persist.jdbc.datasource.url=jdbc:postgresql://postgres:5432/db
      - quarkus.datasource.jdbc.url=jdbc:postgresql://postgres:5432/db
      - quarkus.datasource.username=user
      - quarkus.datasource.password=pass
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://seed/warehouse/
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.secrets.access-key.name=S3_ACCESS_KEY
      - nessie.catalog.secrets.access-key.secret=S3_SECRET_KEY
      - nessie.catalog.service.s3.default-options.access-key.name=S3_ACCESS_KEY
      - nessie.catalog.service.s3.default-options.access-key.secret=S3_SECRET_KEY
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.external-endpoint=http://minio:9000/

code:

from polars import DataFrame
from pyarrow import Table as ArrowTable
from pyiceberg.catalog import load_catalog
from pyiceberg.table import Table as IcebergTable

def test_iceberg() -> None:
    numbers = tuple(range(100))
    sample = DataFrame(
        data={
            'integer': numbers,
        },
    )
    arrow_table: ArrowTable = sample.to_arrow()
    namespace = 'default'
    identifier = 'default.integers'
    table_properties = {
        'write.format.default': 'parquet',
        'write.delete.format.default': 'parquet',
    }
    catalog = load_catalog()
    catalog.create_namespace_if_not_exists(namespace=namespace)
    table: IcebergTable
    table = catalog.create_table_if_not_exists(
        identifier=identifier,
        schema=arrow_table.schema,
        properties=table_properties,
    )
    table.append(df=arrow_table)

Nessie server type (docker/uber-jar/built from source) and version

ghcr.io/projectnessie/nessie:0.95.0

Client type (Ex: UI/Spark/pynessie ...) and version

pyiceberg = "0.7.1"

Additional information

PYICEBERG_CATALOG__DEFAULT__URI="http://nessie:19120/iceberg/"
dimas-b commented 2 weeks ago

@MuslimBeibytuly : it looks like the error is from PyIceberg code. Could you explain what you think the problem might be on the Nessie side?

Please note that Nessie Servers configure clients to do "remote signing" for S3 access by default. Does PyIceberg support that?

MuslimBeibytuly commented 2 weeks ago

my bad, it was pyiceberg's problem