starburstdata / dbt-trino

The Trino (https://trino.io/) adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
216 stars 55 forks source link

Using seeds produces empty files in the seed dir #226

Closed Nintorac closed 1 year ago

Nintorac commented 1 year ago

Expected behavior

I created a project with some seeds and materialized tables, I then can use the pyarrow library to load the tables.

Actual behavior

For any of the materialised tables there is no issue and the tables are loaded as expected. However for tables that are the result of seeds there is an error as pyarrow tries to load the directory as a file. When listing the dir you can see two files in it where there should only be one.

Another behavior I have noted that may be related is that when rerunning the seed function results in an error because a file cannot be overwritten though I have not confirmed this hunch.

Steps To Reproduce

Given this docker compose

version: '3.7'
services:

  trino-coordinator:
    image: 'trinodb/trino:392'
    hostname: trino-coordinator
    user: trino:trino
    ports:
      - '8080:8080'
    volumes:
      - ./infra/trino/etc:/etc/trino
      - ./trino-cli-392-executable.jar:/trino
    tmpfs:
      - /spill:rw
    networks:
      - midi-etl
    restart: on-failure
    deploy:
      replicas: 1
      resources:
        limits:
          memory: 15G

  mariadb:
    image: 'mariadb:latest'
    hostname: mariadb
    ports:
      - '3306:3306'
    environment:
      MYSQL_ROOT_PASSWORD: admin
      MYSQL_USER: admin
      MYSQL_PASSWORD: admin
      MYSQL_DATABASE: metastore_db
    volumes:
      - mariadb-data:/var/lib/mysql/
    networks:
      - midi-etl

  hive-metastore:
    image: 'bitsondatadev/hive-metastore:latest'
    hostname: hive-metastore
    ports:
      - '9083:9083' # Metastore Thrift
    volumes:
      - ./infra/hive/conf/metastore-site.xml:/opt/apache-hive-metastore-3.0.0-bin/conf/metastore-site.xml:ro
    environment:
      METASTORE_DB_HOSTNAME: mariadb
    depends_on:
      - mariadb
    networks:
      - midi-etl

  minio:
    image: 'minio/minio:latest'
    hostname: minio
    ports:
      - '9000:9000'
      - '9001:9001'
    volumes:
      - minio-reddit-data:/data
    environment:
      MINIO_ACCESS_KEY: minio
      MINIO_SECRET_KEY: minio123
    command: server --console-address ":9001" /data
    networks:
      - midi-etl

volumes:
  minio-reddit-data:
    driver: local
  mariadb-data:
    driver: local
  spill:
    driver: local

networks:
  midi-etl:
    driver: bridge

create a dbt project with a seed file, deploy the seeds and then try to run the following

from pyarrow import parquet as pq
import pandas as pd
import pyarrow.dataset as ds
import s3fs

#%%
endpoint_url ='http://localhost:9000'
aws_access_key_id ='minio'
aws_secret_access_key ='minio123'
fs = s3fs.S3FileSystem(
        anon=False,
        use_ssl=False,
        client_kwargs={
            "region_name": "us-east-1",
            "endpoint_url": endpoint_url,
            "aws_access_key_id": aws_access_key_id,
            "aws_secret_access_key": aws_secret_access_key,
            "verify": False,
        }
    ) 
pq.read_table('path/to/seeds', filesystem=fs)

then observe that there are two files when listing the seed dir

>>> fs.listdir("path/to/seeds")
[{'Key': 'path/to/seeds/',
  'LastModified': datetime.datetime(2023, 2, 5, 7, 43, 49, 822000, tzinfo=tzutc()),
  'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
  'Size': 0,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'minio',
   'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'},
  'type': 'file',
  'size': 0,
  'name': 'path/to/seeds/'},
 {'Key': 'path/to/seeds/20230205_074350_00008_9aq5u_01fdb7ec-57c5-4b22-88e5-c84ee87b83a9',
  'LastModified': datetime.datetime(2023, 2, 5, 7, 43, 51, 756000, tzinfo=tzutc()),
  'ETag': '"855ed46e79733e1293349ec2ca583a5b"',
  'Size': 1773,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'minio',
   'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'},
  'type': 'file',
  'size': 1773,
  'name': 'path/to/seeds/20230205_074350_00008_9aq5u_01fdb7ec-57c5-4b22-88e5-c84ee87b83a9'}]

doing the same thing on a materialised table there is only a single entry

>>> fs.listdir("path/to/some_model")
[{'Key': 'path/to/some_model/20230205_074350_00008_9aq5u_01fdb7ec-57c5-4b22-88e5-c84ee87b83a9',
  'LastModified': datetime.datetime(2023, 2, 5, 7, 43, 51, 756000, tzinfo=tzutc()),
  'ETag': '"855ed46e79733e1293349ec2ca583a5b"',
  'Size': 1773,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'minio',
   'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'},
  'type': 'file',
  'size': 1773,
  'name': 'path/to/seeds/20230205_074350_00008_9aq5u_01fdb7ec-57c5-4b22-88e5-c84ee87b83a9'}]

Log output/Screenshots

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py in line 2
      [51](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=50) #%%
----> [52](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=51) program_information = ds.dataset(
      [53](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=52)     'midietl/midi_standard/program_information',
      [54](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=53)     filesystem=fs
      [55](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=54) ).to_table()
      [56](file:///home/athena/nintorac/lakh_dataset/midi_etl_notebooks/basic_analysis.py?line=55) conn.query("select * from program_information limit 10").to_df()

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py:762, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    [751](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=750) kwargs = dict(
    [752](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=751)     schema=schema,
    [753](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=752)     filesystem=filesystem,
   (...)
    [758](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=757)     selector_ignore_prefixes=ignore_prefixes
    [759](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=758) )
    [761](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=760) if _is_path_like(source):
--> [762](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=761)     return _filesystem_dataset(source, **kwargs)
    [763](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=762) elif isinstance(source, (tuple, list)):
    [764](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=763)     if all(_is_path_like(elem) for elem in source):

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py:455, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    [447](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=446) options = FileSystemFactoryOptions(
    [448](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=447)     partitioning=partitioning,
    [449](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=448)     partition_base_dir=partition_base_dir,
    [450](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=449)     exclude_invalid_files=exclude_invalid_files,
    [451](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=450)     selector_ignore_prefixes=selector_ignore_prefixes
    [452](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=451) )
    [453](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=452) factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> [455](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/dataset.py?line=454) return factory.finish(schema)

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2062, in pyarrow._dataset.DatasetFactory.finish()

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File ~/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/fs.py:422, in FSSpecHandler.open_input_file(self, path)
    [419](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/fs.py?line=418) from pyarrow import PythonFile
    [421](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/fs.py?line=420) if not self.fs.isfile(path):
--> [422](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/fs.py?line=421)     raise FileNotFoundError(path)
    [424](file:///home/athena/miniconda3/envs/lakh_midi/lib/python3.10/site-packages/pyarrow/fs.py?line=423) return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: midietl/midi_standard/program_information/

Operating System

Fedora 36 Workstation

dbt version

1.3.1

Trino Server version

392

Python version

Python 3.10.8

Are you willing to submit PR?

hovaesco commented 1 year ago

I then can use the pyarrow library to load the tables.

It's not tested in dbt-trino, why would you like to use pyarrow for it? You can just load files (seeds) via dbt-trino.

Nintorac commented 1 year ago

sorry, not clear, the seeds are loaded via dbt-trino, then read with pq.read_table('path/to/seed', filesystem=fs)

mdesmet commented 1 year ago

This issue has nothing to with dbt-trino. dbt-trino just publishes seeds to the specified connector (eg hive, delta, ...) using Trino. The produced files may not necessarily be parquet files. Maybe that's the issue?

Anyway it would be either a pyarrow issue (not being able to read a parquet file) or a Trino issue (not producing a correct parquet file).

I will close this issue but feel free to continue the conversation if you have any more questions or remarks.

Nintorac commented 1 year ago

OK, probably still worth tracking since it also results in the inability to redeploy seeds....maybe there is some version issue on my end perhaps? this is quite hampering with any seeds in the project. Do you see that behavior?

mdesmet commented 1 year ago

There is no difference between a seed and a table from Trino perspective.

Seed is only a dbt concept. First an empty table is created and then a prepared INSERT statement is performed.

I would just inspect the files and check if they are parquet files (try to open them up individually).

Nintorac commented 1 year ago

Hmm, yea II have, the folder created for the seed contains two files, one named / and one named <random_string>.parquet. Opening the parquet file by itself is successful. opening the folder is not.

As you see from the OP around >>> fs.listdir("path/to/seeds") there is specifically an extra file created when using seeds that is not present for models, which is shown just after that. So something is different.

Related to https://github.com/trinodb/trino/issues/1053#issuecomment-508392276 -- to 0 file is only created when issuing the create table command.

Setting hive.metastore.thrift.delete-files-on-drop=true in minio.properties file seems to fix the failure to delete error.

eg.

before setting hive.metastore.thrift.delete-files-on-drop

(lakh_midi) λ  midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed

13:55:00  Running with dbt=1.3.1
13:55:00  Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:55:00  
13:55:03  Concurrency: 1 threads (target='dev')
13:55:03  
13:55:03  1 of 1 START seed file midi_standard.program_information ....................... [RUN]
13:55:04  1 of 1 ERROR loading seed file midi_standard.program_information ............... [ERROR in 0.92s]
13:55:04  
13:55:04  Finished running 1 seed in 0 hours 0 minutes and 3.52 seconds (3.52s).
13:55:04  
13:55:04  Completed with 1 error and 0 warnings:
13:55:04  
13:55:04  Database Error in seed program_information (seeds/midi_standard/program_information.csv)
13:55:04    TrinoExternalError(type=EXTERNAL, name=HIVE_PATH_ALREADY_EXISTS, message="Target directory for table 'midi_standard.program_information' already exists: s3a://midietl/midi_standard/program_information", query_id=20230207_135503_00007_q84c6)
13:55:04  
13:55:04  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

and after

(lakh_midi) λ  midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed

13:54:24  Running with dbt=1.3.1
13:54:24  Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:54:24  
13:54:24  Concurrency: 1 threads (target='dev')
13:54:24  
13:54:24  1 of 1 START seed file midi_standard.program_information ....................... [RUN]
13:54:26  1 of 1 OK loaded seed file midi_standard.program_information ................... [INSERT 128 in 1.26s]
13:54:26  
13:54:26  Finished running 1 seed in 0 hours 0 minutes and 1.88 seconds (1.88s).
13:54:26  
13:54:26  Completed successfully
13:54:26  
13:54:26  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
(lakh_midi) λ  midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed

13:54:53  Running with dbt=1.3.1
13:54:54  Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:54:54  
13:54:55  Encountered an error:
Database Error
  TrinoQueryError(type=INTERNAL_ERROR, name=SERVER_STARTING_UP, message="Trino server is still initializing", query_id=20230207_135454_00000_q84c6)
martindurant commented 4 months ago

I followed to this issue from the linked arrow one. Perhaps someone could determine at which point the zero-length directory placeholder file is being created? Whilst I don't know why pyarrow is confused by it (it's not a problem for fastparquet, for instance), it is unnecessary.