Transcription question - Githubissues

alysawyer commented 1 year ago

Hi - as we are copying and pasting it we are noticing numerous issues with the OCR (beyond just the one you mentioned), there are probably a lot more but here are just some that stood out: m becomes rn Ō becomes theta Iūlius becomes liilius peristylō becomes perist)i'lo or perisrylo domī becomes d0ml

Is it still worth training on all 35 chapters if the finetune data is kinda messed up?

mikeizbicki commented 1 year ago

Hmm... that's unfortunate. It's still worth doing the fine tune since it's low effort on your end, and we can still measure how much it improves performance. My guess is that it will still improve performance.

We can use the gap between the hand-copied finetune performance and the ocr finetune performance to get an estimate of how big of an impact the ocr problems have on performance. The best way to do this would be to do 2 fine tunes of the ocr text, one on the full book and one on only the first 5 chapters. The full book finetune would be the one to use for the paper, and the first 5 chapters would give us the estimate of how much the quality dropped.

irajmoradi commented 1 year ago

We found a website that had text of the textbook here. http://76.23.73.83/llpsi/XXX.html

We managed to get text files from all the chapters and are setting up our fine-tune data now.

mikeizbicki / modulus-magnus-linguae

Transcription question #52