hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
822 stars 45 forks source link

Integrate preprocessor loading/functioning in Model #50

Closed arxyzan closed 12 months ago

arxyzan commented 1 year ago

Right now, in order to use a model to predict on raw data you have to do it like this:

from hezar import Model, Tokenizer

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"

model = Model.load(model_path)
tokenizer = Tokenizer.load(model_path)

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
inputs = tokenizer(example, return_tensors="pt")

outputs = model.predict(inputs)

But, for normal users, this might be vague. What they might want is something like this:

from hezar import Model

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"
example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load(model_path)
outputs = model.predict(example)
arxyzan commented 1 year ago

The solutions I have in mind right now: 1. Handle composable preprocessors: First we need to make a functionality that supports composable preprocessors. For example if a pipeline needs to normalize the raw text and then tokenize it, we would need a preprocessor like this:

from hezar import preprocessor
preprocessors = preprocessor.Sequential(["nfkc", "wordpiece"])

Consequently, a sequential preprocessor should be able to load a container like above, from the Hub so that the inputs can be fed to the preprocessor in one go. To tackle this, first we have to add a new config parameter to the preprocessor config that specifies what preprocessors are needed for a given model, or we can implement a functionality to automatically detect and load every preprocessor in a Hub repo under preprocessor folder.

The issue with this is that although we handle every preprocessing of any of our models, but for every model we might want to impelement the same post-processing because that is not configurable.

2. Model pipelines: For every task we provide a Pipeline class just like other libraries. In a pipeline, all the flows are handled end-to-end.

Note: Both of the above solutions require the composable preprocessing to be handled.

arxyzan commented 12 months ago

Following the latest release (0.16.0), now one can load a preprocessor or preprocessors like this:

from hezar import Preprocessor

preprocessors = Preprocessor.load("hezarai/roberta-fa-sentiment-digikala-snappfood")
print(preprocessors)

Note that if a repo has only one preprocessor, this code will only return that as the preprocessor object but if there are more preprocessors, the returned value would be a dict of preprocessors. Besides, you don't have to know what preprocessor you need to use its own class for loading.

# Old
normalizer = Normalizer.load(...)
tokenizer = Tokenizer.load(...)

# New
normalizer = Preprocessor.load(...)
tokenizer = Preprocessor.load(...)

This schema helps us in the future to easily load the whole preprocessing pipeline needed for a model.

arxyzan commented 12 months ago

Following the above changes and v0.17.0, now you don't need to load and call the preprocessor for the model. From now on, loading the preprocessors are handled in Model.load(...) and preprocessings are handled in Model.preprocess(...).

OLD

from hezar import Model, Tokenizer

hub_path = "hezarai/roberta-fa-sentiment-digikala-snappfood"
model = Model.load(hub_path, device="cpu")
tokenizer = Tokenizer.load(hub_path)
inputs = ["کتابخانه هزار، بهترین کتابخانه هوش مصنوعیه"]
model_inputs = tokenizer(inputs, return_tensors="pt", device="cpu")
model_outputs = model.predict(model_inputs)
print(model_outputs)

NEW

from hezar import Model

hub_path = "hezarai/distilbert-fa-sentiment-digikala-snappfood"
model = Model.load(hub_path, device="cpu")
inputs = ["کتابخانه هزار، بهترین کتابخانه هوش مصنوعیه"]
model_outputs = model.predict(inputs)
print(model_outputs)