Fix file size records and schema for all-null cols

A PR to [hopefully!] fix the buildkite builds. To assemble this, I installed the nightly version of pyarrow, cleared the data cache temp dir (important!) and then reran everything to see what sizes things came out as.

Beyond file sizes, this makes a couple other small tweaks:

changes store_and_fwd_flag in nyctaxi back to a string, because while it's almost all null, 0, 1, or 2, there's one row with a * value that was breaking things
turned the schemas into functions like {arrowbench}. Not sure this matters (i.e. if the instance was correctly getting bundled with the package or a pointer accidentally getting left behind), but this way is definitely safe.
set strings_can_be_null=True in pyarrow.csv.ConvertOptions because this is what pandas does and what pretty much all users will want, albeit not the default. Also I think it makes some file sizes smaller?

It's hard to say why exactly file sizes changed (or didn't) here—there are a lot of factors, some of which (like types) are changed here, some (like default row group sizing and parquet version) are part of Arrow. For now, I'm just trying to make things run, but longer-term we should probably get rid of the file size mess and check the resulting data dimensions and schema instead.

@joosthooz @austin3dickey Some of this may be relevant for datalogistik?

@boshek When this merges, hopefully the Arrow report should populate properly

@ElenaHenderson Is there a way to run this branch on the buildkite machines before merging? I've tested thoroughly on my machine, but there may be more differences in those ones. We'll also need to rm the data temp dir again.

@alistaire47

There is nothing special about ursa-i9-9960x machine where benchmarks run. We run Python and R benchmarks on other machines and they work without any special modification for any particular machine.

I would just run all Python and R benchmarks listed in benchmarks.json locally with iterations=1 using command like conbench csv-read...

You can't test benchmarks on the actual ursa-i9-9960x benchmark machines but you can test it locally using Ubuntu 20.2 docker image if you must but I really doubt you need to do this: https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/README.md#how-can-i-test-benchmark-builds-that-run-on-ursa-i9-9960x-and-ursa-thinkcentre-m75q-locally

Before you run benchmarks locally using above instructions, you will need to update benchmarks repo and branch being used by arrow-bci locally:

https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L29 https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L31

You should also update filters to run only Python and R benchmarks: https://github.com/ursacomputing/arrow-benchmarks-ci/blob/main/buildkite/benchmark/run.py#L212


{
                    "langs": {
                        "Python": {
                            "names": [
                                "csv-read",
                                "dataframe-to-table",
                                "dataset-filter",
                                "dataset-read",
                                "dataset-select",
                                "dataset-selectivity",
                                "file-read",
                                "file-write",
                                "wide-dataframe",
                            ]
                        },
                        "R": {
                            "names": [
                                "dataframe-to-table",
                                "file-read",
                                "file-write",
                                "partitioned-dataset-filter",
                                "wide-dataframe",
                                "tpch",
                            ]
                        }
           }
}

Note that I have not tried this in a while but it should work.

I tried to get this running in Docker, but docker build bailed with

 => [ 7/13] RUN apt-get update -y -q &&     apt-get install -y -q --no-install-recommends openjdk-8-jdk maven &&     apt-get clean &&     rm -rf /var/lib/apt/lists*                    31.5s
 => ERROR [ 8/13] RUN update-java-alternatives -s java-1.8.0-openjdk-amd64                                                                                                               0.3s
------
 > [ 8/13] RUN update-java-alternatives -s java-1.8.0-openjdk-amd64:
#10 0.249 update-java-alternatives: directory does not exist: /usr/lib/jvm/java-1.8.0-openjdk-amd64
------
executor failed running [/bin/sh -c update-java-alternatives -s java-1.8.0-openjdk-amd64]: exit code: 1

But I ran everything locally with the conbench CLI and it passed, so this code is at least well-tested on one machine. Going to merge for now, and I'll keep an eye on the buildkite builds.

@ElenaHenderson could you clear the cache when you get a second?

voltrondata-labs / benchmarks

Fix file size records and schema for all-null cols #116