strange error with splitfile

mingfang commented 2 years ago

I completed this demo https://www.splitgraph.com/docs/getting-started/decentralized-demo and am trying out different splitfiles.

My first attempt is something simple. In an empty splitfile, I added this and works. FROM demo/weather IMPORT rdu AS source_data

But when I replace that line with something I thought was equivalent, it doesn't work

# not work
FROM demo/weather IMPORT {SELECT * FROM rdu} AS source_data

I'm getting this error

>sgr -v DEBUG build rdu-weather-summary.splitfile 
Executing Splitfile rdu-weather-summary.splitfile with arguments {}

Step 1/1 : FROM demo/weather IMPORT {SELECT * FROM rdu} AS source_data
Resolving repository demo/weather
Gathering remote metadata...
Fetched metadata for 2 images, 1 table, 0 objects and 1 tag.
Importing 1 table from demo/weather:b2019b4321c1 into rdu-weather-summary
debug: Mounting demo/weather_tmp_clone:b2019b4321c116c277ba966435ff1c2ed8f5c037ae0650b6abf3efec7a39984c/rdu into o5e43720645ede717012cd3a3b67bfa4f
error: Traceback (most recent call last):
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 47, in _checkout_or_calculate_layer
error:     output.images.by_hash(image_hash).checkout()
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/core/image_manager.py", line 119, in by_hash
error:     raise ImageNotFoundError("No images starting with %s found!" % image_hash)
error: splitgraph.exceptions.ImageNotFoundError: No images starting with 91a5102a961e4a3c64f0cbdec5099d054aaa07b74ee81829fd0517f688a9be9c found!
error: 
error: During handling of the above exception, another exception occurred:
error: 
error: Traceback (most recent call last):
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/commandline/__init__.py", line 114, in invoke
error:     result = super(click.Group, self).invoke(ctx)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
error:     return _process_result(sub_ctx.command.invoke(sub_ctx))
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
error:     return ctx.invoke(self.callback, **ctx.params)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/click/core.py", line 610, in invoke
error:     return callback(*args, **kwargs)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/commandline/splitfile.py", line 57, in build_c
error:     execute_commands(splitfile.read(), args, output=output_repository)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 208, in execute_commands
error:     provenance_line = _execute_import(node, output)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 336, in _execute_import
error:     return _execute_repo_import(
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 434, in _execute_repo_import
error:     _checkout_or_calculate_layer(target_repository, target_hash, _calc)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 51, in _checkout_or_calculate_layer
error:     calc_func()
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/splitfile/execution.py", line 424, in _calc
error:     target_repository.import_tables(
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/core/common.py", line 139, in wrapped
error:     return func(self, *args, **kwargs)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/core/repository.py", line 782, in import_tables
error:     return self._import_tables(
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/core/repository.py", line 831, in _import_tables
error:     self._import_new_table(
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/core/repository.py", line 898, in _import_new_table
error:     self.object_engine.run_sql_in(
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/engine/__init__.py", line 157, in run_sql_in
error:     result = self.run_sql(sql, arguments, return_shape=return_shape)
error:   File "/home/projector-user/.pyenv/versions/3.9.5/lib/python3.9/site-packages/splitgraph/engine/postgres/engine.py", line 516, in run_sql
error:     cur.execute(statement, arguments)
error: psycopg2.errors.InternalError_: Error in python: KeyError
error: DETAIL:  'splitgraph'
error:

mildbyte commented 2 years ago

Looks like that error is coming from the engine -- do you have the logs from it too (sgr engine log or the Docker container logs if you're running without the sgr engine wrapper)?

mingfang commented 2 years ago

I'm running Splitgraph inside Kubernetes using this image: splitgraph/engine:0.2.15-postgis

Here is the log

 splitgraph-0:PostgreSQL Database directory appears to contain a database; Skipping initialization                                                                                                          
 splitgraph-0:2021-08-11 21:05:56.164 GMT [1] LOG:  starting PostgreSQL 12.7 (Debian 12.7-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit                               
 splitgraph-0:2021-08-11 21:05:56.165 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432                                                                                                          
 splitgraph-0:2021-08-11 21:05:56.165 GMT [1] LOG:  listening on IPv6 address "::", port 5432                                                                                                               
 splitgraph-0:2021-08-11 21:05:56.171 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"                                                                                            
 splitgraph-0:2021-08-11 21:05:56.200 GMT [26] LOG:  database system was shut down at 2021-08-11 21:05:43 GMT                                                                                               
 splitgraph-0:2021-08-11 21:05:56.216 GMT [1] LOG:  database system is ready to accept connections                                                                                                          
 splitgraph-0:2021-08-11 21:07:08.814 GMT [56] ERROR:  Error in python: KeyError                                                                                                                            
 splitgraph-0:2021-08-11 21:07:08.814 GMT [56] DETAIL:  Traceback (most recent call last):                                                                                                                  
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:      File "/splitgraph/splitgraph/core/fdw_checkout.py", line 58, in __init__                                                                                                                
 splitgraph-0:        self._initialize_engines()                                                                                                                                                            
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:      File "/splitgraph/splitgraph/core/fdw_checkout.py", line 182, in _initialize_engines                                                                                                    
 splitgraph-0:        use_fdw_params=True,                                                                                                                                                                  
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:      File "/splitgraph/splitgraph/engine/__init__.py", line 665, in get_engine                                                                                                               
 splitgraph-0:        conn_params = cast(Dict[str, Optional[str]], _prepare_engine_config(CONFIG, name))                                                                                                    
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:      File "/splitgraph/splitgraph/engine/__init__.py", line 57, in _prepare_engine_config                                                                                                    
 splitgraph-0:        config_dict if name == "LOCAL" else get_all_in_section(config_dict, "remotes")[name],                                                                                                 
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:    KeyError: 'splitgraph'                                                                                                                                                                    
 splitgraph-0:                                                                                                                                                                                              
 splitgraph-0:2021-08-11 21:07:08.814 GMT [56] STATEMENT:  SET enable_sort=off; SET enable_hashagg=on;CREATE TABLE "splitgraph_meta"."sg_tmp_48a5834071346fb90838d6f08bbb9531" AS SELECT * FROM rdu

mildbyte commented 2 years ago

If you're running it inside of Kubernetes, did you make sure to bind mount / copy the .sgconfig file into the engine container as well (https://www.splitgraph.com/docs/configuration/introduction#in-engine-configuration)? The logging level error makes me think it's using the default empty value and hasn't found a config file.

mingfang commented 2 years ago

I agree with the logging level. I fixed that. But I would recommend to using a default log level instead.

mingfang commented 2 years ago

hmm, bind mounting .sgconfig doesn't make any sense. Keep in mind everything in this demo https://www.splitgraph.com/docs/getting-started/decentralized-demo works with my setup.

It's only a problem when I change the splitfile like this FROM demo/weather IMPORT {SELECT * FROM rdu} AS source_data

mingfang commented 2 years ago

My Splitgraph client, sgr, has this .sgconfig

[defaults]
SG_ENGINE_PORT=6432
SG_ENGINE_PWD=splitgraph
SG_ENGINE_ADMIN_USER=sgr
SG_ENGINE_ADMIN_PWD=splitgraph
SG_UPDATE_LAST=1628646162

[remote: splitgraph]
SG_ENGINE_ADMIN_USER=splitgraph
SG_ENGINE_ADMIN_PWD=splitgraph
SG_ENGINE_POSTGRES_DB_NAME=splitgraph
SG_ENGINE_HOST=splitgraph.splitgraph
SG_ENGINE_PORT=5432
SG_ENGINE_USER=splitgraph
SG_ENGINE_PWD=splitgraph
SG_ENGINE_DB_NAME=splitgraph

I set my env with this

export SG_ENGINE=splitgraph

Looking at the engine error, it looks like it's trying to read its config and looking for the splitgraph remote. Why would the engine need to do that? And why would it only do that when I modify the demo splitfile?

mingfang commented 2 years ago

I was able to get it to work using the LOCAL engine; basically cleared the SG_ENGINE env. Does this imply that (some)splitfiles can only work with the LOCAL engine and not remote engines?

I'm guessing the problem is SG_ENGINE on the client side must match SG_ENGINE on the engine side. This is certainly going to be untrue for large deployments. My use case is to have a central Splitgraph instance running inside Kubernetes, and each client (sgr and python) will connect to it as a remote.

mildbyte commented 2 years ago

You should definitely be able to run Splitfiles against one engine (client) using the data on a different remote (in your case, splitgraph) engine.

The issue is in configuring them to make sure both the sgr client and the "client" engine know where to download the objects from (so both sgr and your client engine need to know how to connect to the remote splitgraph engine), so that's why we put the same .sgconfig into both sgr and the engine itself.

I think in the first case, the Splitfile executor (running in the Python sgr process) just gets enough metadata from the remote engine to move the pointers to make an image with a table that has the same contents in the new image. In the second case, it uses the local engine to create a staging table with the data, so the local engine tries to download the table fragments from the remote splitgraph engine and fails since it doesn't know how to connect to it.

mingfang commented 2 years ago

I created a self contained repo to demonstrate this problem here https://github.com/mingfang/splitfile-remote.git

mildbyte commented 2 years ago

Looks like you only have one engine in that repository? The configuration in your use case is two engines:

The "local" / "client" engine (similar to a local cache or a Snowflake warehouse worker) -- this is managed by sgr and runs queries against local dataset fragments (objects) that it pulls on demand from the remote. That is to say, sgr can't work standalone, it needs a sidecar engine to go with it.
The "remote" / "registry" engine -- this stores metadata (linking fragments to tables, datasets etc) and the actual objects. The first engine (and any other engines) pull objects from it on demand to satisfy queries and build Splitfiles. Another option is actually storing these objects in object storage (https://github.com/splitgraph/splitgraph/blob/master/examples/push-to-object-storage/docker-compose.yml).

mingfang commented 2 years ago

@mildbyte Thanks for the explanation. I was trying to get away with using just one engine(the remote engine), because I didn't want members of my data team to have to learn docker. But if a local engine is a requirement then that's way I will go. Thanks again for your help.

splitgraph / sgr

strange error with splitfile #518