Open davseitsev opened 1 week ago
cc @raunaqmorarka
@davseitsev could you try benchmarking your case with https://github.com/trinodb/trino/pull/22451 ? There could be further improvements made, but I wanted to see if those changes are sufficient for your scenario.
Sure I will check it
Now it's a little bit better.
In comparison to previous results I would say it's about 1.35x times better.
The flame graph looks similar:
That still looks strange because I removed the usage of ColumnPath#get from getColumnChunkMetaData
, but it's still somehow showing in your flamegraph.
I will check it
Ok, it's my bad, I didn't deploy the change properly.
Okay, so we have improved significantly with the new changes but this needs further improvement still. Would you be able to share a sample file, table definition and test query with me so that I can assess potential optimizations more easily ?
@davseitsev I've updated my PR with some more changes, I think the changes should help some more. Can you try out the updated PR as well ?
Sure I will test it and post the results today. Also I'm trying to write some unit test with our wide schema which can be used to estimate the changes.
Here is the result with the latest changes:
Thanks @davseitsev
We've essentially removed the bottleneck from getDescriptors
call, but getColumnChunkMetaData
still needs improvement. I will look for ways to address that in a different PR.
Here is the test you can use to understand the use case and profile it locally TestIcebergWideSchemaRead.java.txt And there is table schema which is used in the test to create test table and generate test data: create_table_no_formatting.txt
Test flame graph look similar to production one:
If you have new changes I can test it on real data to compare total CPU time to previous results.
BTW I just thought, what about generating HashMap index here: https://github.com/trinodb/trino/blob/ad97087b3939818e33511712b48b05419f4463cb/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L180-L184 for all the columns in block, something like:
Map<ArrayWrapper, ColumnChunkMetaData> blocks = metadata.getColumns().stream().collect(Collectors.toMap(
column -> new ArrayWrapper(column.getPath().toArray()),
column -> column
));
And obtain necessary ColumnChunkMetaData from the map to avoid iterating over all the columns again and again. Can it cause any side effects?
@davseitsev PTAL at https://github.com/trinodb/trino/pull/22538 . I think this should resolve the remaining bottlenecks as well. Please try it out and let me know.
We have a use case when we store big structures to iceberg tables. Recently we have found a case where a structure has 45K+ nested fields totally. Simple select queries fail with
Exceeded CPU limit of 48.00h
. For example:Select
count()
with the same filters return 0 records, which means data filtering consume all the CPU.We ran CPU profiler on one of the workers and it showed that all the CPU is consumed by
IcebergPageSourceProvider.createParquetPageSource
method.There 2 loops which take all the CPU time. In the first one we try to find
ColumnChunkMetaData
by iterating through all the columns: https://github.com/trinodb/trino/blob/ad97087b3939818e33511712b48b05419f4463cb/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L496-L500In the second one we iterate over all the columns to find column index: https://github.com/trinodb/trino/blob/ad97087b3939818e33511712b48b05419f4463cb/lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java#L134-L151
As we do it for all the primitive fields, totally we call equals/equalsIgnoreCase more than 2B times. And it happens at least for each file (there are 1400 files in the table).
We are discussing changing schema of this table to use JSON. But anyway maybe it worth improving for smaller schemas.
Trino version: 444