Closed arxyzan closed 2 weeks ago
any update on this issue?
@mohamad-tohidi I'm still trying to figure out a convenient way that is backward compatible and easy to use as well.
The challenge is that datasets can have a lot of different schemas and structures so methods like subclassing is crucial since we cannot either enforce users to restructure their datasets nor implement such flexibility in the Dataset
class for all kinds of schemas.
My current idea is to enforce subclassing the Dataset
class and override the load
method or something like that.
I've been away from the development of this library for a while to take a break but I've already got back to it again. I do the follow up as soon as it's done.
The dataset's load functionality cannot be internally as flexible as everybody would be satisfied with, so we made the _load
an abstract method so that everyone can implement their own.
See more details here: https://hezarai.github.io/hezar/tutorial/datasets.html#custom-datasets
Dataset loading is currently only possible for Hub datasets. If a user needs to load their own from a local path or somewhere else, they would need to subclass a similar dataset class and override the
_load()
method. This might cause some conflicts and might confuse users. We would need some reconsideration regarding datasets loading pipeline and other ways to implement flexible ones.