hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
817 stars 44 forks source link

Improve `Dataset.load()` flexibility #142

Closed arxyzan closed 2 weeks ago

arxyzan commented 6 months ago

Dataset loading is currently only possible for Hub datasets. If a user needs to load their own from a local path or somewhere else, they would need to subclass a similar dataset class and override the _load() method. This might cause some conflicts and might confuse users. We would need some reconsideration regarding datasets loading pipeline and other ways to implement flexible ones.

mohamad-tohidi commented 5 months ago

any update on this issue?

arxyzan commented 5 months ago

@mohamad-tohidi I'm still trying to figure out a convenient way that is backward compatible and easy to use as well. The challenge is that datasets can have a lot of different schemas and structures so methods like subclassing is crucial since we cannot either enforce users to restructure their datasets nor implement such flexibility in the Dataset class for all kinds of schemas. My current idea is to enforce subclassing the Dataset class and override the load method or something like that.

I've been away from the development of this library for a while to take a break but I've already got back to it again. I do the follow up as soon as it's done.

arxyzan commented 2 weeks ago

The dataset's load functionality cannot be internally as flexible as everybody would be satisfied with, so we made the _load an abstract method so that everyone can implement their own. See more details here: https://hezarai.github.io/hezar/tutorial/datasets.html#custom-datasets