voltrondata-labs / benchmarks

Language-independent Continuous Benchmarking (CB) for Apache Arrow
MIT License
10 stars 11 forks source link

Fix file size records and schema for all-null cols #116

Closed alistaire47 closed 2 years ago

alistaire47 commented 2 years ago

A PR to [hopefully!] fix the buildkite builds. To assemble this, I installed the nightly version of pyarrow, cleared the data cache temp dir (important!) and then reran everything to see what sizes things came out as.

Beyond file sizes, this makes a couple other small tweaks:

It's hard to say why exactly file sizes changed (or didn't) here—there are a lot of factors, some of which (like types) are changed here, some (like default row group sizing and parquet version) are part of Arrow. For now, I'm just trying to make things run, but longer-term we should probably get rid of the file size mess and check the resulting data dimensions and schema instead.

@joosthooz @austin3dickey Some of this may be relevant for datalogistik?

@boshek When this merges, hopefully the Arrow report should populate properly

@ElenaHenderson Is there a way to run this branch on the buildkite machines before merging? I've tested thoroughly on my machine, but there may be more differences in those ones. We'll also need to rm the data temp dir again.

ElenaHenderson commented 2 years ago

@alistaire47

There is nothing special about ursa-i9-9960x machine where benchmarks run. We run Python and R benchmarks on other machines and they work without any special modification for any particular machine.

I would just run all Python and R benchmarks listed in benchmarks.json locally with iterations=1 using command like conbench csv-read...

You can't test benchmarks on the actual ursa-i9-9960x benchmark machines but you can test it locally using Ubuntu 20.2 docker image if you must but I really doubt you need to do this: https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/README.md#how-can-i-test-benchmark-builds-that-run-on-ursa-i9-9960x-and-ursa-thinkcentre-m75q-locally

Before you run benchmarks locally using above instructions, you will need to update benchmarks repo and branch being used by arrow-bci locally:

https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L29 https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L31

You should also update filters to run only Python and R benchmarks: https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L212


{
                    "langs": {
                        "Python": {
                            "names": [
                                "csv-read",
                                "dataframe-to-table",
                                "dataset-filter",
                                "dataset-read",
                                "dataset-select",
                                "dataset-selectivity",
                                "file-read",
                                "file-write",
                                "wide-dataframe",
                            ]
                        },
                        "R": {
                            "names": [
                                "dataframe-to-table",
                                "file-read",
                                "file-write",
                                "partitioned-dataset-filter",
                                "wide-dataframe",
                                "tpch",
                            ]
                        }
           }
}

Note that I have not tried this in a while but it should work.
alistaire47 commented 2 years ago

I tried to get this running in Docker, but docker build bailed with

 => [ 7/13] RUN apt-get update -y -q &&     apt-get install -y -q --no-install-recommends openjdk-8-jdk maven &&     apt-get clean &&     rm -rf /var/lib/apt/lists*                    31.5s
 => ERROR [ 8/13] RUN update-java-alternatives -s java-1.8.0-openjdk-amd64                                                                                                               0.3s
------
 > [ 8/13] RUN update-java-alternatives -s java-1.8.0-openjdk-amd64:
#10 0.249 update-java-alternatives: directory does not exist: /usr/lib/jvm/java-1.8.0-openjdk-amd64
------
executor failed running [/bin/sh -c update-java-alternatives -s java-1.8.0-openjdk-amd64]: exit code: 1

But I ran everything locally with the conbench CLI and it passed, so this code is at least well-tested on one machine. Going to merge for now, and I'll keep an eye on the buildkite builds.

@ElenaHenderson could you clear the cache when you get a second?