tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Feature Request: list.train and list.eval from different folders #211

Open Shreeshrii opened 3 years ago

Shreeshrii commented 3 years ago

Current implementation creates all-lstmf from thefoo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see https://github.com/tesseract-ocr/tesstrain/pull/217

becZzZhao commented 3 years ago

Current implementation creates all-lstmf from thefoo-ground-truth directory and splits it into two in the specified ratio by using the head and tail commands.

The disadvantage with this approach is that when there are a limited number of samples of some characters in the training data, there is no way to control that they are evenly divided in the training and eval group. So, it is quite possible that some characters may not be used for training at all.

I suggest letting the user specify two directories, one with training data and one with testing data.

Additionally, It would be great to split the testing data further into two groups for eval and validation. One of the changes in PR#207 does this split using the existing approach using head and tail. EDIT: see #217

Hi, could you specify which command does this? : "Current implementation creates all-lstmf from the foo-ground-truth directory ", Thanks!

Shreeshrii commented 3 years ago

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

becZzZhao commented 3 years ago

could you specify which command does this?

make lists --trace should show you all the commands executed for making the lists.

Thanks!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shreeshrii commented 3 years ago

https://groups.google.com/g/tesseract-ocr/c/HFpYH5i7VRw/m/72tnGgCmDAAJ

Question regarding use of custom list.train and list.eval

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stweil commented 3 years ago

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

bertsky commented 3 years ago

It is always possible to create custom list.train and list.eval and use those instead of the ones created by the Makefile.

It could be documented, though.

However, there's a big catch: the timestamp is important; if your manual list.train and list.eval are older than any of the *.gt.txt (or derived *.lstmf), then they will be overwritten by the next make. So perhaps we should offer some explicit manual override?