Open jerinphilip opened 7 years ago
Evaluation metrics for these experiments are
1) Character Error Rate (%) = Sum of all word<>GT edit distances divided by sum of all GT lengths multiplied by 100
2) Word Error = #correctwords divided by #totalwords multiplied by 100
@Deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.
Let's start running experiments on other books as soon as possible.
I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.
We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.
Sure will do it asap.
On May 24, 2017 12:21 PM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.
Let's start running experiments on other books as soon as possible.
I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.
We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303625582, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLvEeB7NjrE0CfKjLtFpd1MfBjdNsks5r88crgaJpZM4NhOQ4 .
Added Hindi.json in configs directory....unable to push it though due to some permission issues.
On May 24, 2017 1:56 PM, "deepayan das" deepayan137@gmail.com wrote:
Sure will do it asap.
On May 24, 2017 12:21 PM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.
Let's start running experiments on other books as soon as possible.
I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.
We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303625582, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLvEeB7NjrE0CfKjLtFpd1MfBjdNsks5r88crgaJpZM4NhOQ4 .
@Deepayan137: Even I'm getting issues, we'll do pushes later. I'm trying to implement parallel processing for the jobs for speedups. Should be able to get it up and running by evening.
Can you take up one of the pending things in the checklist above and get started on it, meanwhile?
Sure, I am currently working on word suggestions using bigrams. Let's see if it can make our life easier :p
On May 24, 2017 3:05 PM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: Even I'm getting issues, we'll do pushes later. I'm trying to implement parallel processing for the jobs for speedups. Should be able to get it up and running by evening.
Can you take up one of the pending things in the checklist above and get started on it, meanwhile?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303671379, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLlOdETKbF5C7v-EMqCPA0Zk-heqhks5r8_n-gaJpZM4NhOQ4 .
@Deepayan137: Can you use aux/generate_alphabet
on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.
Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.
Sure, will do the necessary.
On May 24, 2017 8:40 PM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: Can you use aux/generate_alphabet on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.
Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303754595, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLjsSqrqJNevR1uZtlidFdRjCZvxNks5r9EhwgaJpZM4NhOQ4 .
Hi Jerin, I added the alphabets true structure for Hindi and also the n-gram approach for error correction.
Presently my error correction code takes vocabulary in a list format. I tried it on an English words corpus of 24k words and the results are ok. The suggestions might improve with a larger data set.
I will also try to add a context based module which will look at the preceding words and come up with probability for all the suggested words.
On May 24, 2017 9:13 PM, "deepayan das" deepayan137@gmail.com wrote:
Sure, will do the necessary.
On May 24, 2017 8:40 PM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: Can you use aux/generate_alphabet on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.
Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303754595, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLjsSqrqJNevR1uZtlidFdRjCZvxNks5r9EhwgaJpZM4NhOQ4 .
@Deepayan137: Where is this? Can you push the files over here, once it's presentable?
Alphabets is in the parameters/error/Hindi And error correction is in SRC/error_module/suggestions.py
I, tried pushing it but was facing the same permission issues.
On May 25, 2017 8:50 AM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: Where is this? Can you push the files over here, once it's presentable?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303911798, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLlks7oApOc1ubWaQKVnTEFBEhsXQks5r9POYgaJpZM4NhOQ4 .
@Deepayan137: Use your local repo to create an push. Then pull on the /OCRData2 repo. It's a little bit overhead, but whatever you commit comes in your name. So.
Okay will do that.
On May 25, 2017 9:08 AM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: Use your local repo to create an push. Then pull on the /OCRData2 repo. It's a little bit overhead, but whatever you commit comes in your name. So.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303913928, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLtHoeKaDbiNKyzr9z81v5nTcvDNYks5r9PfNgaJpZM4NhOQ4 .
@Deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.
Oh sorry my bad....I'll send you the text file containing all the characters in a single line.
On May 25, 2017 9:13 AM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303914411, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLhO-WLJ6fqFj4Ry4X8SilZ4F-hz0ks5r9PjegaJpZM4NhOQ4 .
Alphabet file sent. Please check
On May 25, 2017 9:58 AM, "deepayan das" deepayan137@gmail.com wrote:
Oh sorry my bad....I'll send you the text file containing all the characters in a single line.
On May 25, 2017 9:13 AM, "Jerin Philip" notifications@github.com wrote:
@Deepayan137 https://github.com/deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303914411, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLhO-WLJ6fqFj4Ry4X8SilZ4F-hz0ks5r9PjegaJpZM4NhOQ4 .