Closed arxyzan closed 12 months ago
The solutions I have in mind right now: 1. Handle composable preprocessors: First we need to make a functionality that supports composable preprocessors. For example if a pipeline needs to normalize the raw text and then tokenize it, we would need a preprocessor like this:
from hezar import preprocessor
preprocessors = preprocessor.Sequential(["nfkc", "wordpiece"])
Consequently, a sequential preprocessor should be able to load a container like above, from the Hub so that the inputs can be fed to the preprocessor in one go. To tackle this, first we have to add a new config parameter to the preprocessor config that specifies what preprocessors are needed for a given model, or we can implement a functionality to automatically detect and load every preprocessor in a Hub repo under preprocessor
folder.
The issue with this is that although we handle every preprocessing of any of our models, but for every model we might want to impelement the same post-processing because that is not configurable.
2. Model pipelines:
For every task we provide a Pipeline
class just like other libraries. In a pipeline, all the flows are handled end-to-end.
Note: Both of the above solutions require the composable preprocessing to be handled.
Following the latest release (0.16.0), now one can load a preprocessor or preprocessors like this:
from hezar import Preprocessor
preprocessors = Preprocessor.load("hezarai/roberta-fa-sentiment-digikala-snappfood")
print(preprocessors)
Note that if a repo has only one preprocessor, this code will only return that as the preprocessor object but if there are more preprocessors, the returned value would be a dict of preprocessors. Besides, you don't have to know what preprocessor you need to use its own class for loading.
# Old
normalizer = Normalizer.load(...)
tokenizer = Tokenizer.load(...)
# New
normalizer = Preprocessor.load(...)
tokenizer = Preprocessor.load(...)
This schema helps us in the future to easily load the whole preprocessing pipeline needed for a model.
Following the above changes and v0.17.0, now you don't need to load and call the preprocessor for the model. From now on, loading the preprocessors are handled in Model.load(...)
and preprocessings are handled in Model.preprocess(...)
.
OLD
from hezar import Model, Tokenizer
hub_path = "hezarai/roberta-fa-sentiment-digikala-snappfood"
model = Model.load(hub_path, device="cpu")
tokenizer = Tokenizer.load(hub_path)
inputs = ["کتابخانه هزار، بهترین کتابخانه هوش مصنوعیه"]
model_inputs = tokenizer(inputs, return_tensors="pt", device="cpu")
model_outputs = model.predict(model_inputs)
print(model_outputs)
NEW
from hezar import Model
hub_path = "hezarai/distilbert-fa-sentiment-digikala-snappfood"
model = Model.load(hub_path, device="cpu")
inputs = ["کتابخانه هزار، بهترین کتابخانه هوش مصنوعیه"]
model_outputs = model.predict(inputs)
print(model_outputs)
Right now, in order to use a model to predict on raw data you have to do it like this:
But, for normal users, this might be vague. What they might want is something like this: