Split evaluataion data into two lists, list.eval and list.validate

Shreeshrii commented 3 years ago

If the ratio is 90:10 then

list.train is 90%
list.test is 10%

list.test is split into two equal parts

list.eval is 5%
list.validate is 5%

stweil commented 3 years ago

A 2nd list for validation should not be needed. Tesseract already uses list.eval for the validation:

At iteration 27453/48040/48082, Mean rms=0.589%, delta=2.793%, char train=9.8%, word train=17.46%, skip ratio=0.1%,  New worst char error = 9.8
At iteration 26336, stage 1, Eval Char error rate=13.930723, Word error rate=22.55628 wrote checkpoint.

I think the first CER value of 9.8 % is the accuracy on the training data while the second CER value of 13.9 % is the accuracy on the evaluation data. That evaluation automatically runs parallel to the training process (doubling the number of threads from time to time). Sometimes it does not run because there is still an evaluation running. That happens quite often with large evaluation data. If additional results are desired, it would be possible to run Tesseract with the current checkpoint or already generated traineddata on the evaluation data manually.

stweil commented 3 years ago

@Shreeshrii, according to your plots it looks like there is a difference between "evaluation" and "validation". How does a plot change if you exchange list.eval and list.validate and run the training again?

Shreeshrii commented 3 years ago

I think the first CER value of 9.8 % is the accuracy on the training data while the second CER value of 13.9 % is the accuracy on the evaluation data.

Yes, that's correct.

The reason for adding a third list was that list.eval is evaluated while lstmtraining is running and results might be in the training process to improve results. Having the a 3rd list and using it later for lstmeval helps validate that the training CER, evaluation CER are similar to the validation CER on images not at all seen during training process.

When the training set is very large and representative, then the numbers are similar, but in some cases there are wide differences.

I will share some recent graphs.

On Fri, Jan 15, 2021 at 1:16 AM Stefan Weil notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii, according to your plots it looks like there is a difference between "evaluation" and "validation". How does a plot change if you exchange list.eval and list.validate and run the training again?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/pull/217#issuecomment-760434663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3IM555OXIOEC6KLVTSZ5CXZANCNFSM4U7BESLA .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 3 years ago

How does a plot change if you exchange list.eval and list.validate and run the training again?

@stweil I have not tried that. I will test and report.

Shreeshrii commented 3 years ago

When the training set is very large and representative, then the graphs are similar for both list.eval and list.validate. Hence closing this PR as not needed.

tesseract-ocr / tesstrain

Split evaluataion data into two lists, list.eval and list.validate #217