Open bertsky opened 4 months ago
@bertsky First, let me clarify that our binarization models are not exclusively trained with the DIBCO dataset. In the early stages, the DIBCO dataset was the only ground truth (GT) available to us, so we initially trained some models using it. We then used these trained models to generate pseudo-labeled GT from the SBB datasets. To achieve this, I applied thresholding to binarize almost everything (every element in document images), and then employed scaling and cropping to improve binarization and extract only the desired results from each document image. Consequently, we ended up with a mix of the DIBCO dataset, containing mostly text content, and pseudo-labeled datasets from SBB, which included non-text content as well.
If training was done on DIBCO datasets, some of which according to @VChristlein are representing non-text differently, how can the models here be expected to perform on non-textual parts (separators, ornaments, line drawings, edge drawings etc)?