tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.05k stars 9.39k forks source link

DAWG to WORD List and back to DAWG provide different output #780

Open vdevan opened 7 years ago

vdevan commented 7 years ago

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using dawg2Wordlist.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using wordlist2dawg.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

Shreeshrii commented 7 years ago

You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .

vdevan commented 7 years ago

That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist

On 23/03/2017 2:47 PM, Shreeshrii wrote:

You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread

https://github.com/notifications/unsubscribe-auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe-auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.

Shreeshrii commented 7 years ago

OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.

$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.

$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'

$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new

$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.

$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:

That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist

On 23/03/2017 2:47 PM, Shreeshrii wrote:

You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread

https://github.com/notifications/unsubscribe- auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288612606, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5bf_DtqF2dbpks6KRdF0e1wnm4Zks5rofFjgaJpZM4Ml36q .

vdevan commented 7 years ago

Not just the size, Check the offset byte at position 11 (0x0b) this will be different too, though the size is same...

Not sure if this is any significant at this stage, but let us work with our program further and highlight if there are issues.

Thanks

Always lovingly Vasu Devan V. God give me strength to love everyone. http://www.kamban.com.au http://brahas.com

On 23/03/2017 7:16 PM, Shreeshrii wrote:

OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.

$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.

$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'

$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new

$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.

$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:

That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist

On 23/03/2017 2:47 PM, Shreeshrii wrote:

You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread

https://github.com/notifications/unsubscribe- auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub

https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288612606, or mute the thread

https://github.com/notifications/unsubscribe-auth/AE2_o5bf_DtqF2dbpks6KRdF0e1wnm4Zks5rofFjgaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288646415, or mute the thread https://github.com/notifications/unsubscribe-auth/ASo5MWiiyZLhMdmLb8vzN9K0JLePGt9qks5roipQgaJpZM4Ml36q.

Shreeshrii commented 7 years ago

I don't know if the difference is because of running on windows.

I ran it under bash on windows10.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:46 PM, Vijaya Vasudevan notifications@github.com wrote:

Not just the size, Check the offset byte at position 11 (0x0b) this will be different too, though the size is same...

Not sure if this is any significant at this stage, but let us work with our program further and highlight if there are issues.

Thanks

Always lovingly Vasu Devan V. God give me strength to love everyone. http://www.kamban.com.au http://brahas.com

On 23/03/2017 7:16 PM, Shreeshrii wrote:

OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.

$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.

$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'

$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new

$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.

$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:

That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist

On 23/03/2017 2:47 PM, Shreeshrii wrote:

You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:

We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread

https://github.com/notifications/unsubscribe- auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub

https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288612606, or mute the thread

https://github.com/notifications/unsubscribe-auth/AE2_o5bf_ DtqF2dbpks6KRdF0e1wnm4Zks5rofFjgaJpZM4Ml36q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288646415, or mute the thread https://github.com/notifications/unsubscribe-auth/ ASo5MWiiyZLhMdmLb8vzN9K0JLePGt9qks5roipQgaJpZM4Ml36q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288674305, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ows53f1NlfLmcUpA4dW4H-6XrosJks5rokZtgaJpZM4Ml36q .

nkrot commented 7 years ago

I confirm that extracting wordlist and then converting it back to .word-dawg generates a different file with md5sum that differs from original word-dawg file extracted from officially distributed deu.traineddata. I would like to know what consequences it may have.

I am using tesseract v.3.05 that I compiled from sources and deu.traineddata from here https://github.com/tesseract-ocr/tessdata/blob/3.04.00/deu.traineddata