tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 153 forks source link

Arabic training text is only 80 lines #6

Open Shreeshrii opened 5 years ago

Shreeshrii commented 5 years ago

The training text in langdata_lstm/ara is only 80 lines or so.

Shreeshrii commented 5 years ago

Training text for other languages is thousands of lines.

It seems Arabic training text in the repo is same//similar to the one in langdata (for 3.04).

https://github.com/tesseract-ocr/langdata_lstm/blob/master/ara/ara.training_text

https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text

Shreeshrii commented 5 years ago

Other languages with small training_texts:

     5826 Jun 18 09:48 tgl/tgl.training_text
     6022 Jun 18 09:48 afr/afr.training_text
     7386 Jun 18 09:48 ara/ara.training_text
     7544 Jun 18 09:48 kur/kur.training_text
    38579 Jun 18 09:48 amh/amh.training_text
   143591 Jun 18 09:48 asm/asm.training_text
   412473 Jun 18 09:48 bih/bih.training_text