huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
237 stars 79 forks source link

IX Metadata extractor Multiselect backend: Data transmission issue #6742

Closed gabriel-piles closed 4 months ago

gabriel-piles commented 4 months ago

The metadata extractor for multi-select is malfunctioning in two ways:

Incomplete labeled data: The data extractor is not sending all labeled data for training purposes. While prediction functionality seems unaffected, some files are being skipped during the training process.

Missing context text: Multiselect fields are missing context on the UI. We need to display the 'segment_text' information from the service response.

konzz commented 4 months ago

@gabriel-piles I have a PR for the "Missing context text", regarding "Incomplete labeled data", took me a while to understand why, but apparently, the backend only sends data entities with the main file in the language you are working on, I guess this is intended, but not sure if we want to change this behaviour.

gabriel-piles commented 4 months ago

@konzz It feels a bit inconsistent to train only on the files in the language you're currently working on, but the IX UI shows you files from other languages, and you can make predictions for all the files regardless of language.