parser.add_argument(
"--mmc4_shards",
type=str,
default=mmc4_data_path,
help="path to c4 shards, this should be a glob pattern such as /path/to/shards/shard-{0000..0999}.tar",
)
We can not get the path(e.g /path/to/shards) when call os.path.dirname(shards[0]) as shards is a string.
Should it be changed to shard_list[0], like this:
In get_dataset_size function we want to get the dir path of data file.
But the shards in args is like this:
We can not get the path(e.g /path/to/shards) when call os.path.dirname(shards[0]) as shards is a string. Should it be changed to shard_list[0], like this: