mlfoundations / open_flamingo

An open-source framework for training large multimodal models.
MIT License
3.74k stars 284 forks source link

[BUG] an error in get_dataset_size about args.shards #287

Open ChrisZhangyu opened 10 months ago

ChrisZhangyu commented 10 months ago

In get_dataset_size function we want to get the dir path of data file.

shards_list = list(braceexpand.braceexpand(shards))
dir_path = os.path.dirname(shards[0])

But the shards in args is like this:

 parser.add_argument(
        "--mmc4_shards",
        type=str,
        default=mmc4_data_path,
        help="path to c4 shards, this should be a glob pattern such as /path/to/shards/shard-{0000..0999}.tar",
    )

We can not get the path(e.g /path/to/shards) when call os.path.dirname(shards[0]) as shards is a string. Should it be changed to shard_list[0], like this:

shards_list = list(braceexpand.braceexpand(shards))
dir_path = os.path.dirname(shards_list[0])