lucaro / MeGraS

MIT License
0 stars 4 forks source link

Unable to correctly segment the text by characters. #8

Open duanhuiran opened 2 hours ago

duanhuiran commented 2 hours ago

Steps to reproduce this bug:

  1. Upload a txt file that contains non-English words (e.g., "Zürich").
  2. Use the API to segment the txt file by characters.
  3. The response got via the object URL shows misaligned characters in the segmented results.
  4. The response got via the object URL is not in UTF-8 encoding. image image

Hypothesis: Currently, MeGras saves the buffer size of the text as bounds. However, the number of characters is not always equal to the buffer size.

duanhuiran commented 2 hours ago

@lucaro