remove more script docs

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.005343 / 0.011353 (-0.006010) | 0.003562 / 0.011008 (-0.007447) | 0.062785 / 0.038508 (0.024277) | 0.031459 / 0.023109 (0.008349) | 0.246497 / 0.275898 (-0.029401) | 0.268258 / 0.323480 (-0.055222) | 0.003201 / 0.007986 (-0.004785) | 0.004153 / 0.004328 (-0.000175) | 0.049003 / 0.004250 (0.044753) | 0.042780 / 0.037052 (0.005728) | 0.263857 / 0.258489 (0.005368) | 0.278578 / 0.293841 (-0.015263) | 0.030357 / 0.128546 (-0.098190) | 0.012341 / 0.075646 (-0.063305) | 0.206010 / 0.419271 (-0.213262) | 0.036244 / 0.043533 (-0.007289) | 0.245799 / 0.255139 (-0.009340) | 0.265467 / 0.283200 (-0.017733) | 0.019473 / 0.141683 (-0.122210) | 1.147913 / 1.452155 (-0.304242) | 1.209968 / 1.492716 (-0.282749) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.099393 / 0.018006 (0.081387) | 0.300898 / 0.000490 (0.300408) | 0.000258 / 0.000200 (0.000058) | 0.000044 / 0.000054 (-0.000010) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.018888 / 0.037411 (-0.018523) | 0.062452 / 0.014526 (0.047926) | 0.073799 / 0.176557 (-0.102757) | 0.121297 / 0.737135 (-0.615839) | 0.074855 / 0.296338 (-0.221484) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.283969 / 0.215209 (0.068760) | 2.808820 / 2.077655 (0.731165) | 1.446106 / 1.504120 (-0.058014) | 1.321622 / 1.541195 (-0.219573) | 1.348317 / 1.468490 (-0.120173) | 0.738369 / 4.584777 (-3.846408) | 2.349825 / 3.745712 (-1.395887) | 2.913964 / 5.269862 (-2.355897) | 1.870585 / 4.565676 (-2.695092) | 0.080141 / 0.424275 (-0.344134) | 0.005174 / 0.007607 (-0.002433) | 0.335977 / 0.226044 (0.109933) | 3.356267 / 2.268929 (1.087338) | 1.811149 / 55.444624 (-53.633475) | 1.510685 / 6.876477 (-5.365792) | 1.524960 / 2.142072 (-0.617112) | 0.803900 / 4.805227 (-4.001328) | 0.138294 / 6.500664 (-6.362370) | 0.042241 / 0.075469 (-0.033229) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.975597 / 1.841788 (-0.866191) | 11.395109 / 8.074308 (3.320801) | 9.837724 / 10.191392 (-0.353668) | 0.141474 / 0.680424 (-0.538950) | 0.015075 / 0.534201 (-0.519126) | 0.304285 / 0.579283 (-0.274998) | 0.267845 / 0.434364 (-0.166519) | 0.342808 / 0.540337 (-0.197529) | 0.434299 / 1.386936 (-0.952637) |

PyArrow==latest

Show updated benchmarks!

### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.005612 / 0.011353 (-0.005741) | 0.003808 / 0.011008 (-0.007201) | 0.050533 / 0.038508 (0.012024) | 0.032635 / 0.023109 (0.009526) | 0.265522 / 0.275898 (-0.010376) | 0.289763 / 0.323480 (-0.033716) | 0.004395 / 0.007986 (-0.003590) | 0.002868 / 0.004328 (-0.001460) | 0.048443 / 0.004250 (0.044193) | 0.040047 / 0.037052 (0.002995) | 0.279013 / 0.258489 (0.020524) | 0.314499 / 0.293841 (0.020658) | 0.032321 / 0.128546 (-0.096225) | 0.011902 / 0.075646 (-0.063744) | 0.059827 / 0.419271 (-0.359445) | 0.034388 / 0.043533 (-0.009145) | 0.270660 / 0.255139 (0.015521) | 0.290776 / 0.283200 (0.007576) | 0.017875 / 0.141683 (-0.123808) | 1.188085 / 1.452155 (-0.264070) | 1.221384 / 1.492716 (-0.271332) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.095619 / 0.018006 (0.077613) | 0.305331 / 0.000490 (0.304841) | 0.000217 / 0.000200 (0.000018) | 0.000049 / 0.000054 (-0.000006) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.022481 / 0.037411 (-0.014930) | 0.076957 / 0.014526 (0.062431) | 0.087830 / 0.176557 (-0.088726) | 0.128290 / 0.737135 (-0.608845) | 0.090565 / 0.296338 (-0.205774) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.291861 / 0.215209 (0.076652) | 2.869776 / 2.077655 (0.792121) | 1.575114 / 1.504120 (0.070994) | 1.449873 / 1.541195 (-0.091322) | 1.450333 / 1.468490 (-0.018158) | 0.723319 / 4.584777 (-3.861458) | 0.972603 / 3.745712 (-2.773109) | 2.940909 / 5.269862 (-2.328953) | 1.889664 / 4.565676 (-2.676012) | 0.078654 / 0.424275 (-0.345621) | 0.005197 / 0.007607 (-0.002410) | 0.344380 / 0.226044 (0.118336) | 3.387509 / 2.268929 (1.118580) | 1.981590 / 55.444624 (-53.463034) | 1.643214 / 6.876477 (-5.233263) | 1.640435 / 2.142072 (-0.501638) | 0.802037 / 4.805227 (-4.003191) | 0.133016 / 6.500664 (-6.367648) | 0.040861 / 0.075469 (-0.034608) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 1.026372 / 1.841788 (-0.815416) | 11.959931 / 8.074308 (3.885623) | 10.122523 / 10.191392 (-0.068869) | 0.144443 / 0.680424 (-0.535981) | 0.015629 / 0.534201 (-0.518572) | 0.304802 / 0.579283 (-0.274481) | 0.120538 / 0.434364 (-0.313826) | 0.343394 / 0.540337 (-0.196943) | 0.437544 / 1.386936 (-0.949392) |

huggingface / datasets

remove more script docs #7104