Closed krenova closed 3 months ago
Hi! Let me just try to explain better.
There are two kinds of models: Visual and non-visual. We have one visual model trained by Alibaba Research Group. Even though we are not the owners of this model, we make it easier to use. This model takes care of both segmentation of the page and the detection of the types. It's the default model in the service, needs more resources than the non-visual models, slower but more accurate.
As the non visual models, we have two of them and we trained them ourselves using LightGBM. The first model detects the types of each token. "Token" generally means a line of text but it can be any part of a text, it depends on the extracted XML in the background (by Poppler). After detecting the types, we pass the features (including types extracted by the first model) to the second model, which detects the segmentation in the page. Even though there are two models running in the "fast" method, since this is a non-visual approach, it requires less resources and is faster but slightly less accurate.
So, there are no connection between visual and non-visual models. These are just two different "modes" you can use according to your needs.
I hope this answers your question! :)
If you need anything else, please do not hesitate to reach. Thanks!
Absolutely clear. Thanks for the prompt response! Looking forward to play with this over the weekend.
Hi there, interesting project here. Was reading the model description but I was unable to understand how the second model is integrated with the first.
For example, is the second model a second layer analysis that refines the output of the first model or is it 1 part of a blended model with the first?
Appreciate your inputs:)