kevinjqliu / iceberg-rest-catalog

Pythonic Iceberg REST Catalog
67 stars 12 forks source link

PyIceberg with Azure Storage Account (500 Internal Server Error) #4

Open C-Zuge opened 4 months ago

C-Zuge commented 4 months ago

While using the pyiceberg got some issues/questions that blocked me, mainly regarding an internal server error 500 after the execution of a simple "create_table" function. Since I'm pretty new on iceberg stuff, probably I'm missing something that I don't know more about. Could anyone help me? I created a namespace and list it, but as soon as I try to create a table on my azure storage account I got the same error 500. My credentials are right, but im using the connection string and pointing the "warehouse" parameter to my storage account such as: "abfs://@.dfs.core.windows.net/". image

I was looking the dockerfile and didnt saw anything that i should change, and the only files that was using the aws connection was inside models (that i believe that is to build new code with this models) and inside "tests" folder. Also, its is not clear to me about the "vendors" folder, why do you clone the pyiceberg to the container and the usage of this is not clear to me as well.

kevinjqliu commented 4 months ago

I posted a partial explanation regarding the client side. https://github.com/apache/iceberg-python/issues/939#issuecomment-2234269294

For running the REST catalog server using this repo, you'd need to configure the server to be able to talk to your storage. For example, if you're trying to use Azure, here are some of the configs required https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#azure-data-lake

Another example when running REST catalog server with minio (s3 compatible API) https://github.com/kevinjqliu/iceberg-rest-catalog/blob/main/examples/sqlite-minio/docker-compose.yml#L11-L17

Also, its is not clear to me about the "vendors" folder, why do you clone the pyiceberg to the container and the usage of this is not clear to me as well.

While working on this repo, I discovered some bugs related to Pyiceberg. It was easier to iterate using Pyiceberg as submodule so that I can commit the fix right away. Some of these issues are upstreamed already (see https://github.com/apache/iceberg-python/issues/864)

kevinjqliu commented 4 months ago

To debug your issue above, look at the server log! HTTP 500 error usually indicates that the server ran into an error.

C-Zuge commented 4 months ago

Regarding this case below, i fullfilled (almost) all the fields on the link here, but the adlfs.sas_token. For some unknown reason (at least for me) the error says about an "AWS Error NETWORK_CONNECTION" but should be using the azure connection. And this type of configuration i didnt found inside the Dockerfile neither other place but inside "tests" and "models" folders. Also, i put some comments inside the logs to be clear what operation i did in each step.

Code: image

Error from Docker container:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:54386 - "GET /v1/config? warehouse=abfs%3A%2F%2Flanding%40sandboxnonprodstorage.dfs.core.windows.net%2F HTTP/1.1" 200 OK    <------- FIRST REQUEST (Just to create the namespace and list the tables)
INFO:     172.17.0.1:54396 - "POST /v1/namespaces HTTP/1.1" 200 OK    <------- NAMESPACE CREATION
INFO:     172.17.0.1:54396 - "GET /v1/namespaces HTTP/1.1" 200 OK      <------- NAMESPACE LIST
INFO:     172.17.0.1:54396 - "GET /v1/namespaces/iceberg_rest/tables HTTP/1.1" 200 OK  <----- TABLE'S LIST (Null as expected)
INFO:     172.17.0.1:32978 - "GET /v1/config?warehouse=abfs%3A%2F%2Flanding%40sandboxnonprodstorage.dfs.core.windows.net%2F HTTP/1.1" 200 OK <------ SECOND REQUEST (List namespaces, tables and create_table itself)
INFO:     172.17.0.1:32988 - "GET /v1/namespaces HTTP/1.1" 200 OK
INFO:     172.17.0.1:32988 - "GET /v1/namespaces/iceberg_rest/tables HTTP/1.1" 200 OK
INFO:     172.17.0.1:32988 - "POST /v1/namespaces/iceberg_rest/tables HTTP/1.1" 500 Internal Server Error  <----CREATE_TABLE FUNCTION
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/iceberg_rest/api/catalog_api.py", line 297, in create_table
    return _create_table(catalog, identifier, create_table_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/iceberg_rest/api/catalog_api.py", line 343, in _create_table
    tbl = catalog.create_table(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/catalog/sql.py", line 208, in create_table
    self._write_metadata(metadata, io, metadata_location)
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/catalog/__init__.py", line 843, in _write_metadata
    ToOutputFile.table_metadata(metadata, io.new_output(metadata_path))
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 130, in table_metadata
    with output_file.create(overwrite=overwrite) as output_stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 304, in create
    if not overwrite and self.exists() is True:
                         ^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 248, in exists
    self._file_info()  # raises FileNotFoundError if it does not exist
    ^^^^^^^^^^^^^^^^^
  File "/home/iceberg/iceberg_rest/.venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 230, in _file_info
    file_info = self._filesystem.get_file_info(self._path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 584, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When getting information for key 'rest/iceberg_rest.db/stations2000/metadata/00000-89d73996-40a2-458f-bdb9-1d1eff86a65b.metadata.json' in bucket 'warehouse': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 7, Couldn't connect to server
kevinjqliu commented 4 months ago

the error says about an "AWS Error NETWORK_CONNECTION" but should be using the azure connection

The REST server is a wrapper around the underlying catalog. Looks like the catalog config is currently hardcoded to use AWS configs. https://github.com/kevinjqliu/iceberg-rest-catalog/blob/7c5548133ae266d4fac215b063911c35f08461d9/src/iceberg_rest/catalog.py#L14-L24

Would need to change this to also take in Azure configs instead

You can quickly verify this by passing the configs directly to this dict

C-Zuge commented 4 months ago

Worked fine after inserting the connection string parameter inside this function. But also i saw that SQLCatalog is in use rather the RESTCatalog and i was wondering why this choice? Also i tried to change to RESTCatalog but got some issues on the server side shown below, how could i fix this to use properly the RESTCatalog rather the SQLCatalog? Also, i build the postgres version but its trying to use SQLite, why?

Changes: image

Error on server side:

raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 'sqlite:////tmp/warehouse/pyiceberg_catalog.db/v1/config'
kevinjqliu commented 4 months ago

But also i saw that SQLCatalog is in use rather the RESTCatalog and i was wondering why this choice?

This repo implements the REST Catalog server, it accepts HTTP requests and then proxies to the underlying catalog. The server needs to get/set table metadata. In this case, the metadata is ultimately saved in the SqlCatalog. You can make a change to replace SqlCatalog with RestCatalog, which means the metadata will ultimately be saved in another RestCatalog service.

Also i tried to change to RESTCatalog but got some issues on the server side shown below, how could i fix this to use properly the RESTCatalog rather the SQLCatalog?

Don't change it to RestCatalog, unless there's another REST catalog server you can point to.

Also, i build the postgres version but its trying to use SQLite, why?

Are you using docker? The uri controls what data store is ultimately used