huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Improved the tutorial by adding a link for loading datasets #7042

Closed AmboThom closed 3 months ago

AmboThom commented 4 months ago

Improved the tutorial by letting readers know about loading datasets with common files and including a link. I left the local files section alone because the methods were already listed with code snippets.

github-actions[bot] commented 3 months ago
Show benchmarks PyArrow==8.0.0
Show updated benchmarks! ### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.005135 / 0.011353 (-0.006218) | 0.003389 / 0.011008 (-0.007619) | 0.063053 / 0.038508 (0.024545) | 0.031597 / 0.023109 (0.008487) | 0.237519 / 0.275898 (-0.038379) | 0.263101 / 0.323480 (-0.060379) | 0.003109 / 0.007986 (-0.004877) | 0.002699 / 0.004328 (-0.001630) | 0.048611 / 0.004250 (0.044361) | 0.042937 / 0.037052 (0.005884) | 0.253760 / 0.258489 (-0.004729) | 0.275444 / 0.293841 (-0.018397) | 0.028952 / 0.128546 (-0.099594) | 0.011837 / 0.075646 (-0.063809) | 0.207620 / 0.419271 (-0.211651) | 0.035727 / 0.043533 (-0.007806) | 0.241770 / 0.255139 (-0.013369) | 0.270509 / 0.283200 (-0.012691) | 0.020709 / 0.141683 (-0.120974) | 1.135722 / 1.452155 (-0.316432) | 1.200355 / 1.492716 (-0.292361) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.092555 / 0.018006 (0.074549) | 0.284719 / 0.000490 (0.284229) | 0.000210 / 0.000200 (0.000010) | 0.000049 / 0.000054 (-0.000005) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.018431 / 0.037411 (-0.018980) | 0.063618 / 0.014526 (0.049092) | 0.075371 / 0.176557 (-0.101185) | 0.120982 / 0.737135 (-0.616153) | 0.075718 / 0.296338 (-0.220620) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.279439 / 0.215209 (0.064230) | 2.722274 / 2.077655 (0.644619) | 1.442314 / 1.504120 (-0.061806) | 1.323166 / 1.541195 (-0.218029) | 1.339642 / 1.468490 (-0.128848) | 0.723451 / 4.584777 (-3.861326) | 2.334879 / 3.745712 (-1.410833) | 2.938745 / 5.269862 (-2.331116) | 1.867278 / 4.565676 (-2.698398) | 0.078704 / 0.424275 (-0.345571) | 0.005128 / 0.007607 (-0.002479) | 0.338634 / 0.226044 (0.112589) | 3.266239 / 2.268929 (0.997311) | 1.815276 / 55.444624 (-53.629349) | 1.487158 / 6.876477 (-5.389319) | 1.547550 / 2.142072 (-0.594522) | 0.804458 / 4.805227 (-4.000769) | 0.139186 / 6.500664 (-6.361479) | 0.042935 / 0.075469 (-0.032534) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.978223 / 1.841788 (-0.863564) | 11.350997 / 8.074308 (3.276689) | 10.082980 / 10.191392 (-0.108412) | 0.145067 / 0.680424 (-0.535357) | 0.014132 / 0.534201 (-0.520069) | 0.302162 / 0.579283 (-0.277121) | 0.264603 / 0.434364 (-0.169761) | 0.338466 / 0.540337 (-0.201871) | 0.427891 / 1.386936 (-0.959045) |
PyArrow==latest
Show updated benchmarks! ### Benchmark: benchmark_array_xd.json | metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.006078 / 0.011353 (-0.005275) | 0.004030 / 0.011008 (-0.006978) | 0.051646 / 0.038508 (0.013138) | 0.031263 / 0.023109 (0.008154) | 0.279437 / 0.275898 (0.003539) | 0.304489 / 0.323480 (-0.018991) | 0.004553 / 0.007986 (-0.003433) | 0.002869 / 0.004328 (-0.001459) | 0.050638 / 0.004250 (0.046387) | 0.041091 / 0.037052 (0.004038) | 0.290681 / 0.258489 (0.032192) | 0.332059 / 0.293841 (0.038218) | 0.033353 / 0.128546 (-0.095193) | 0.012506 / 0.075646 (-0.063141) | 0.061788 / 0.419271 (-0.357484) | 0.034150 / 0.043533 (-0.009382) | 0.278258 / 0.255139 (0.023119) | 0.298084 / 0.283200 (0.014885) | 0.019106 / 0.141683 (-0.122577) | 1.164475 / 1.452155 (-0.287679) | 1.204804 / 1.492716 (-0.287912) | ### Benchmark: benchmark_getitem\_100B.json | metric | get_batch_of\_1024\_random_rows | get_batch_of\_1024\_rows | get_first_row | get_last_row | |--------|---|---|---|---| | new / old (diff) | 0.100053 / 0.018006 (0.082047) | 0.301255 / 0.000490 (0.300765) | 0.000220 / 0.000200 (0.000020) | 0.000057 / 0.000054 (0.000003) | ### Benchmark: benchmark_indices_mapping.json | metric | select | shard | shuffle | sort | train_test_split | |--------|---|---|---|---|---| | new / old (diff) | 0.023536 / 0.037411 (-0.013876) | 0.078513 / 0.014526 (0.063987) | 0.090281 / 0.176557 (-0.086276) | 0.129607 / 0.737135 (-0.607528) | 0.090742 / 0.296338 (-0.205596) | ### Benchmark: benchmark_iterating.json | metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 | |--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | new / old (diff) | 0.304082 / 0.215209 (0.088873) | 2.909401 / 2.077655 (0.831747) | 1.587210 / 1.504120 (0.083090) | 1.458713 / 1.541195 (-0.082482) | 1.472579 / 1.468490 (0.004089) | 0.716542 / 4.584777 (-3.868235) | 0.947557 / 3.745712 (-2.798155) | 2.908044 / 5.269862 (-2.361817) | 1.886382 / 4.565676 (-2.679294) | 0.078105 / 0.424275 (-0.346170) | 0.005802 / 0.007607 (-0.001805) | 0.357883 / 0.226044 (0.131839) | 3.490958 / 2.268929 (1.222029) | 1.946574 / 55.444624 (-53.498050) | 1.645167 / 6.876477 (-5.231310) | 1.649242 / 2.142072 (-0.492830) | 0.796864 / 4.805227 (-4.008363) | 0.134206 / 6.500664 (-6.366458) | 0.041439 / 0.075469 (-0.034030) | ### Benchmark: benchmark_map_filter.json | metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow | |--------|---|---|---|---|---|---|---|---|---| | new / old (diff) | 1.012311 / 1.841788 (-0.829477) | 12.396967 / 8.074308 (4.322659) | 10.382494 / 10.191392 (0.191102) | 0.157395 / 0.680424 (-0.523029) | 0.015154 / 0.534201 (-0.519047) | 0.302209 / 0.579283 (-0.277074) | 0.127430 / 0.434364 (-0.306934) | 0.348933 / 0.540337 (-0.191404) | 0.442930 / 1.386936 (-0.944006) |