kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Batch mode seems having troubles with encoding of characters from cjk languages #1156

Closed lfoppiano closed 3 months ago

lfoppiano commented 3 months ago

I've ran some japanese documents in batch mode and the output's characters were not properly encoded. I forgot, is the batch mode deprecated? If so we could keep this task for removing it, otherwise I can send some examples.

kermitt2 commented 3 months ago

Batch mode is not deprecated and works the same as the service. Do you observe a different output from the service?

lfoppiano commented 3 months ago

OK it's very likely not a problem of Grobid.

I think there is something wrong with the Google Cloud linux machine 🤔 If I process it with the API the output is correctly encoded on both GC Linux and Mac. If I process with the batch process only the GC linux return garbage. So it's likely something wrong with the locale configured there or so

For reference I share the PDF https://jaas-org.jp/uploads/files/3549/30_1_31.pdf I leave the two version of the output files: Archive.zip