Closed Nintorac closed 1 year ago
I then can use the pyarrow library to load the tables.
It's not tested in dbt-trino, why would you like to use pyarrow for it? You can just load files (seeds) via dbt-trino.
sorry, not clear, the seeds are loaded via dbt-trino, then read with pq.read_table('path/to/seed', filesystem=fs)
This issue has nothing to with dbt-trino. dbt-trino just publishes seeds to the specified connector (eg hive, delta, ...) using Trino. The produced files may not necessarily be parquet files. Maybe that's the issue?
Anyway it would be either a pyarrow issue (not being able to read a parquet file) or a Trino issue (not producing a correct parquet file).
I will close this issue but feel free to continue the conversation if you have any more questions or remarks.
OK, probably still worth tracking since it also results in the inability to redeploy seeds....maybe there is some version issue on my end perhaps? this is quite hampering with any seeds in the project. Do you see that behavior?
There is no difference between a seed and a table from Trino perspective.
Seed is only a dbt concept. First an empty table is created and then a prepared INSERT statement is performed.
I would just inspect the files and check if they are parquet files (try to open them up individually).
Hmm, yea II have, the folder created for the seed contains two files, one named /
and one named <random_string>.parquet
. Opening the parquet file by itself is successful. opening the folder is not.
As you see from the OP around >>> fs.listdir("path/to/seeds")
there is specifically an extra file created when using seeds that is not present for models, which is shown just after that. So something is different.
Related to https://github.com/trinodb/trino/issues/1053#issuecomment-508392276 -- to 0
file is only created when issuing the create table command.
Setting hive.metastore.thrift.delete-files-on-drop=true
in minio.properties
file seems to fix the failure to delete error.
eg.
before setting hive.metastore.thrift.delete-files-on-drop
(lakh_midi) λ midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed
13:55:00 Running with dbt=1.3.1
13:55:00 Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:55:00
13:55:03 Concurrency: 1 threads (target='dev')
13:55:03
13:55:03 1 of 1 START seed file midi_standard.program_information ....................... [RUN]
13:55:04 1 of 1 ERROR loading seed file midi_standard.program_information ............... [ERROR in 0.92s]
13:55:04
13:55:04 Finished running 1 seed in 0 hours 0 minutes and 3.52 seconds (3.52s).
13:55:04
13:55:04 Completed with 1 error and 0 warnings:
13:55:04
13:55:04 Database Error in seed program_information (seeds/midi_standard/program_information.csv)
13:55:04 TrinoExternalError(type=EXTERNAL, name=HIVE_PATH_ALREADY_EXISTS, message="Target directory for table 'midi_standard.program_information' already exists: s3a://midietl/midi_standard/program_information", query_id=20230207_135503_00007_q84c6)
13:55:04
13:55:04 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
and after
(lakh_midi) λ midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed
13:54:24 Running with dbt=1.3.1
13:54:24 Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:54:24
13:54:24 Concurrency: 1 threads (target='dev')
13:54:24
13:54:24 1 of 1 START seed file midi_standard.program_information ....................... [RUN]
13:54:26 1 of 1 OK loaded seed file midi_standard.program_information ................... [INSERT 128 in 1.26s]
13:54:26
13:54:26 Finished running 1 seed in 0 hours 0 minutes and 1.88 seconds (1.88s).
13:54:26
13:54:26 Completed successfully
13:54:26
13:54:26 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
(lakh_midi) λ midi_etl_dbt git:(main) ✗ dbt --profiles-dir .. seed
13:54:53 Running with dbt=1.3.1
13:54:54 Found 8 models, 37 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 12 sources, 0 exposures, 0 metrics
13:54:54
13:54:55 Encountered an error:
Database Error
TrinoQueryError(type=INTERNAL_ERROR, name=SERVER_STARTING_UP, message="Trino server is still initializing", query_id=20230207_135454_00000_q84c6)
I followed to this issue from the linked arrow one. Perhaps someone could determine at which point the zero-length directory placeholder file is being created? Whilst I don't know why pyarrow is confused by it (it's not a problem for fastparquet, for instance), it is unnecessary.
Expected behavior
I created a project with some seeds and materialized tables, I then can use the pyarrow library to load the tables.
Actual behavior
For any of the materialised tables there is no issue and the tables are loaded as expected. However for tables that are the result of seeds there is an error as pyarrow tries to load the directory as a file. When listing the dir you can see two files in it where there should only be one.
Another behavior I have noted that may be related is that when rerunning the seed function results in an error because a file cannot be overwritten though I have not confirmed this hunch.
Steps To Reproduce
Given this docker compose
create a dbt project with a seed file, deploy the seeds and then try to run the following
then observe that there are two files when listing the seed dir
doing the same thing on a materialised table there is only a single entry
Log output/Screenshots
Operating System
Fedora 36 Workstation
dbt version
1.3.1
Trino Server version
392
Python version
Python 3.10.8
Are you willing to submit PR?