Close gzipped files properly

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.005388 / 0.011353 (-0.005965) | 0.003822 / 0.011008 (-0.007187) | 0.063285 / 0.038508 (0.024777) | 0.033780 / 0.023109 (0.010671) | 0.239580 / 0.275898 (-0.036318) | 0.264203 / 0.323480 (-0.059277) | 0.004207 / 0.007986 (-0.003778) | 0.002716 / 0.004328 (-0.001612) | 0.049569 / 0.004250 (0.045319) | 0.048591 / 0.037052 (0.011538) | 0.252606 / 0.258489 (-0.005884) | 0.285998 / 0.293841 (-0.007843) | 0.028650 / 0.128546 (-0.099896) | 0.010652 / 0.075646 (-0.064994) | 0.203962 / 0.419271 (-0.215310) | 0.036207 / 0.043533 (-0.007326) | 0.240374 / 0.255139 (-0.014765) | 0.263564 / 0.283200 (-0.019636) | 0.017722 / 0.141683 (-0.123961) | 1.143741 / 1.452155 (-0.308414) | 1.192452 / 1.492716 (-0.300264) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.141329 / 0.018006 (0.123323) | 0.320169 / 0.000490 (0.319679) | 0.000240 / 0.000200 (0.000041) | 0.000045 / 0.000054 (-0.000009) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.019885 / 0.037411 (-0.017526) | 0.063322 / 0.014526 (0.048796) | 0.075446 / 0.176557 (-0.101110) | 0.122619 / 0.737135 (-0.614517) | 0.077175 / 0.296338 (-0.219163) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.281292 / 0.215209 (0.066083) | 2.796220 / 2.077655 (0.718565) | 1.456035 / 1.504120 (-0.048085) | 1.334445 / 1.541195 (-0.206750) | 1.380223 / 1.468490 (-0.088267) | 0.575895 / 4.584777 (-4.008882) | 2.375791 / 3.745712 (-1.369921) | 2.926273 / 5.269862 (-2.343589) | 1.832586 / 4.565676 (-2.733090) | 0.064323 / 0.424275 (-0.359952) | 0.005403 / 0.007607 (-0.002204) | 0.334088 / 0.226044 (0.108043) | 3.321174 / 2.268929 (1.052246) | 1.821432 / 55.444624 (-53.623193) | 1.520181 / 6.876477 (-5.356296) | 1.582487 / 2.142072 (-0.559585) | 0.645641 / 4.805227 (-4.159586) | 0.119596 / 6.500664 (-6.381068) | 0.043144 / 0.075469 (-0.032325) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.985104 / 1.841788 (-0.856684) | 12.518240 / 8.074308 (4.443932) | 10.017118 / 10.191392 (-0.174274) | 0.133900 / 0.680424 (-0.546524) | 0.014591 / 0.534201 (-0.519610) | 0.288326 / 0.579283 (-0.290957) | 0.262292 / 0.434364 (-0.172072) | 0.327601 / 0.540337 (-0.212736) | 0.421525 / 1.386936 (-0.965411) |

PyArrow==latest

Show updated benchmarks!

### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.005546 / 0.011353 (-0.005807) | 0.003961 / 0.011008 (-0.007047) | 0.051745 / 0.038508 (0.013237) | 0.032587 / 0.023109 (0.009478) | 0.266886 / 0.275898 (-0.009012) | 0.301327 / 0.323480 (-0.022153) | 0.004273 / 0.007986 (-0.003713) | 0.002851 / 0.004328 (-0.001477) | 0.049333 / 0.004250 (0.045082) | 0.044530 / 0.037052 (0.007478) | 0.286829 / 0.258489 (0.028340) | 0.310732 / 0.293841 (0.016892) | 0.029925 / 0.128546 (-0.098621) | 0.011270 / 0.075646 (-0.064377) | 0.059071 / 0.419271 (-0.360200) | 0.033899 / 0.043533 (-0.009633) | 0.270448 / 0.255139 (0.015309) | 0.286935 / 0.283200 (0.003735) | 0.019516 / 0.141683 (-0.122167) | 1.125815 / 1.452155 (-0.326339) | 1.179893 / 1.492716 (-0.312823) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.096476 / 0.018006 (0.078470) | 0.305149 / 0.000490 (0.304660) | 0.000207 / 0.000200 (0.000008) | 0.000046 / 0.000054 (-0.000009) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.023648 / 0.037411 (-0.013763) | 0.082847 / 0.014526 (0.068322) | 0.089210 / 0.176557 (-0.087347) | 0.130194 / 0.737135 (-0.606941) | 0.091700 / 0.296338 (-0.204639) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.290995 / 0.215209 (0.075786) | 2.870335 / 2.077655 (0.792680) | 1.595661 / 1.504120 (0.091541) | 1.452319 / 1.541195 (-0.088876) | 1.505647 / 1.468490 (0.037157) | 0.575856 / 4.584777 (-4.008921) | 1.005527 / 3.745712 (-2.740185) | 2.927824 / 5.269862 (-2.342038) | 1.791702 / 4.565676 (-2.773975) | 0.064804 / 0.424275 (-0.359471) | 0.005203 / 0.007607 (-0.002404) | 0.348615 / 0.226044 (0.122570) | 3.463989 / 2.268929 (1.195060) | 1.947758 / 55.444624 (-53.496866) | 1.669974 / 6.876477 (-5.206502) | 1.721663 / 2.142072 (-0.420410) | 0.650999 / 4.805227 (-4.154228) | 0.117769 / 6.500664 (-6.382895) | 0.041738 / 0.075469 (-0.033731) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 1.004140 / 1.841788 (-0.837648) | 13.035487 / 8.074308 (4.961179) | 10.318152 / 10.191392 (0.126760) | 0.143776 / 0.680424 (-0.536648) | 0.016272 / 0.534201 (-0.517929) | 0.286564 / 0.579283 (-0.292719) | 0.126579 / 0.434364 (-0.307785) | 0.397253 / 0.540337 (-0.143085) | 0.424968 / 1.386936 (-0.961968) |

huggingface / datasets

Close gzipped files properly #6893