Open Oufattole opened 2 months ago
The pull request introduces a modification to the command-line usage example for the aces-cli
tool. Specifically, it updates the argument for the data.shard
parameter from qux/#
to baz/
, changing the directory from which dataset shards are loaded. This correction ensures that users are directed to the appropriate location for accessing their MEDS dataset shards.
File | Change Summary |
---|---|
docs/source/usage.md | Updated command-line usage example for aces-cli to change data.shard parameter from qux/# to baz/ , reflecting a new source directory for dataset shards. |
In the meadow where data flows,
A rabbit hops where the shard path goes.
Fromqux
tobaz
, a leap so bright,
Guiding users to the correct site.
With every command, our joy will grow,
In the world of datasets, we’ll surely glow! 🐇✨
docs/source/usage.md (1)
`215-215`: **Correction to the `expand_shards` function usage.** This change updates the argument for the `data.shard` parameter in the command-line usage example for loading a MEDS dataset with multiple shards. The argument has been corrected from `qux/#` to `baz/`, indicating that the shards will now be sourced from the `baz/` directory. This correction aligns the usage example with the expected input for the `expand_shards` function and is crucial for users to ensure they are referencing the correct directory for their dataset shards.
Just checked, I think both are correct and should probably be included? The config sets the path to path: ${data.root}/${data.shard}.parquet
Case 1: MEDS files like dir/train/0.parquet
, dir/train/1.parquet
, and dir/held_out/0.parquet
.
Then data.root
would be dir/
and expand_shards
would be called using expand_shards("train/2", "held_out/0")
Case 2: We also added support just a simple directory of parquet files like dir/0.parquet
, dir/1.parquet
, and dir/2.parquet
.
So data.root
would be dir/
and expand_shards
would be called using expand_shards("dir/")
If this seems right I'll add both
@Oufattole can we merge or close this? Not sure which would be more appropriate, but I'd like to get it closed out rather than remaining indefinitely open. Thanks!
expand_shards should take in the same input as data.root in the documentation right?
Summary by CodeRabbit
aces-cli
tool to reflect the correct directory for loading MEDS dataset shards.data.shard
parameter to ensure users reference the correct location for their dataset shards.