Open vdevan opened 7 years ago
You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9j9rKZsfNj3JCUe4m_XVu0Q7o4sks5roZjngaJpZM4Ml36q .
That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist
On 23/03/2017 2:47 PM, Shreeshrii wrote:
You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe-auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.
OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.
$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.
$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'
$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new
$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.
$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:
That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist
On 23/03/2017 2:47 PM, Shreeshrii wrote:
You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288612606, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5bf_DtqF2dbpks6KRdF0e1wnm4Zks5rofFjgaJpZM4Ml36q .
Not just the size, Check the offset byte at position 11 (0x0b) this will be different too, though the size is same...
Not sure if this is any significant at this stage, but let us work with our program further and highlight if there are issues.
Thanks
Always lovingly Vasu Devan V. God give me strength to love everyone. http://www.kamban.com.au http://brahas.com
On 23/03/2017 7:16 PM, Shreeshrii wrote:
OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.
$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.
$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'
$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new
$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.
$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:
That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist
On 23/03/2017 2:47 PM, Shreeshrii wrote:
You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288612606, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288646415, or mute the thread https://github.com/notifications/unsubscribe-auth/ASo5MWiiyZLhMdmLb8vzN9K0JLePGt9qks5roipQgaJpZM4Ml36q.
I don't know if the difference is because of running on windows.
I ran it under bash on windows10.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:46 PM, Vijaya Vasudevan notifications@github.com wrote:
Not just the size, Check the offset byte at position 11 (0x0b) this will be different too, though the size is same...
Not sure if this is any significant at this stage, but let us work with our program further and highlight if there are issues.
Thanks
Always lovingly Vasu Devan V. God give me strength to love everyone. http://www.kamban.com.au http://brahas.com
On 23/03/2017 7:16 PM, Shreeshrii wrote:
OK. I also get a different sized dawg file. But when I extract the wordlist from the new dawg, it is same size as original wordlist.
$ dawg2wordlist eng.unicharset eng.number-dawg number.txt Loading word list from eng.number-dawg Reading squished dawg Word list loaded.
$ wordlist2dawg number.txt eng.number-dawg-new eng.unicharset Loading unicharset from 'eng.unicharset' Reading word list from 'number.txt' Reducing Trie to SquishedDawg Writing squished DAWG to 'eng.number-dawg-new'
$ ls -l eng.number* -rwxrwxrwx 1 root root 6426 Mar 23 13:04 eng.number-dawg -rwxrwxrwx 1 root root 3954 Mar 23 13:37 eng.number-dawg-new
$ dawg2wordlist eng.unicharset eng.number-dawg-new number-new.txt Loading word list from eng.number-dawg-new Reading squished dawg Word list loaded.
$ ls -l number* -rwxrwxrwx 1 root root 2534 Mar 23 13:38 number-new.txt -rwxrwxrwx 1 root root 2534 Mar 23 13:36 number.txt
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 9:43 AM, Vijaya Vasudevan notifications@github.com wrote:
That was typo, which I have already corrected in the issue. The issue of converting DAWG to word list & Word list to DAWG does exist
On 23/03/2017 2:47 PM, Shreeshrii wrote:
You need to use dawg2wordlist for converting from the unpacked dawg files to wordlists. You should be able to view the wordlists in a text editor. You need wordlist2dawg for converting for converting from wordlists to dawg. You seem to have used the opposite.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Mar 23, 2017 at 3:25 AM, Vijaya Vasudevan notifications@github.com wrote:
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using wordlist2dawg.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using dawg2wordlist.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288609575, or mute the thread https://github.com/notifications/unsubscribe- auth/ASo5MXAufsditmj1Iinsyt-T31e9Lmmrks5roes6gaJpZM4Ml36q.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288612606, or mute the thread
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780# issuecomment-288646415, or mute the thread https://github.com/notifications/unsubscribe-auth/ ASo5MWiiyZLhMdmLb8vzN9K0JLePGt9qks5roipQgaJpZM4Ml36q.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/780#issuecomment-288674305, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ows53f1NlfLmcUpA4dW4H-6XrosJks5rokZtgaJpZM4Ml36q .
I confirm that extracting wordlist and then converting it back to .word-dawg
generates a different file with md5sum that differs from original word-dawg
file extracted from officially distributed deu.traineddata
. I would like to know what consequences it may have.
I am using tesseract v.3.05 that I compiled from sources and deu.traineddata
from here https://github.com/tesseract-ocr/tessdata/blob/3.04.00/deu.traineddata
We were testing on eng.traineddata. We use WPF / c# to create our own Windows User Interface using Nuget package manager to download 3.2.0-alpha2 version. The problem can be easily created. Using combine-tessata.exe -u unpack eng.trainddata. Then using dawg2Wordlist.exe and the eng.unicharset, convert eng.punc-dawg file or eng.number-dawg file. (Do not use eng.word or eng.bigram file, the next step will take a lot of time) Make a copy of the eng.punc-dawg or eng.number-dawg file. (original) Now using wordlist2dawg.exe and the same eng.unicharset, create a eng.punc-dawg or eng.number-dawg. Compare the output file with the original file. You wil find the files are different! Not sure if this problem is already listed here. If it is. my apologies for duplicating.