Experiments: Post ARL - Githubissues

jerinphilip commented 7 years ago

[x] Leave one out, enable on a page.
- Vocabulary to be modified at runtime.
- To be run separate from the consolidated corpus.
[ ] All the weird metrics You-Know-Who wants.
- [x] Character Error Rate
- [x] Word Error Rate
- [x] Correct - Error Module = OCR = Ground Truth
- [x] Correct - OCR Matching Ground Truth, unvalidated by Error Module
- [x] Correct - Error Module = OCR != Ground Truth
- [x] Correctable - OCR Output != Ground Truth in dictionary.
- [x] Uncorrectable - OCR Output != Ground Truth not in Dictionary
- [ ] Time Metrics
[x] Best case scenario: Add ground truths in the vocabulary.
[ ] Fix the models. Get word level models.
[ ] Fix line mapping function, inorder to run this on line level.
[ ] Consolidate corpus for multiple Languages
- [ ] Gurumukhi, Hindi first.
- [ ] Malayalam, Telugu last.
[x] Run on India today, with newspaper dataset.
[ ] Estimate Error Correction Time
[x] Estimate Word Error, Dictionary Overlap Estimate.
[ ] LSH: Approximate Edit Distance to speed things up.

mineshmathew commented 7 years ago

Evaluation metrics for these experiments are

1) Character Error Rate (%) = Sum of all word<>GT edit distances divided by sum of all GT lengths multiplied by 100

2) Word Error = #correctwords divided by #totalwords multiplied by 100

jerinphilip commented 7 years ago

@Deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.

Let's start running experiments on other books as soon as possible.

I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.

We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.

Deepayan137 commented 7 years ago

Sure will do it asap.

On May 24, 2017 12:21 PM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.

Let's start running experiments on other books as soon as possible.

I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.

We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303625582, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLvEeB7NjrE0CfKjLtFpd1MfBjdNsks5r88crgaJpZM4NhOQ4 .

Deepayan137 commented 7 years ago

Added Hindi.json in configs directory....unable to push it though due to some permission issues.

On May 24, 2017 1:56 PM, "deepayan das" deepayan137@gmail.com wrote:

Sure will do it asap.

On May 24, 2017 12:21 PM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: I've modified the code to use a config file, whic is supplied as an argument to the script. I've also added ability to spawn multiple process for each book, I'll enhance it further in a while. Can you create a config file for Hindi looking at the one for Malayalam in configs/malayalam.json? I want to make sure it's usable in other languages also, before proceeding further.

Let's start running experiments on other books as soon as possible.

I'm planning to incorporate leave out in a fraction. n experiments leaving out k/n of the vocabulary. Sir insisted on the leave one out in the last meeting.

We'll leave out like 0.2, 0.4, 0.6, 0.8 for a start and then produce the inferences in the next meeting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303625582, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLvEeB7NjrE0CfKjLtFpd1MfBjdNsks5r88crgaJpZM4NhOQ4 .

jerinphilip commented 7 years ago

@Deepayan137: Even I'm getting issues, we'll do pushes later. I'm trying to implement parallel processing for the jobs for speedups. Should be able to get it up and running by evening.

Can you take up one of the pending things in the checklist above and get started on it, meanwhile?

Deepayan137 commented 7 years ago

Sure, I am currently working on word suggestions using bigrams. Let's see if it can make our life easier :p

On May 24, 2017 3:05 PM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: Even I'm getting issues, we'll do pushes later. I'm trying to implement parallel processing for the jobs for speedups. Should be able to get it up and running by evening.

Can you take up one of the pending things in the checklist above and get started on it, meanwhile?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303671379, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLlOdETKbF5C7v-EMqCPA0Zk-heqhks5r8_n-gaJpZM4NhOQ4 .

jerinphilip commented 7 years ago

@Deepayan137: Can you use aux/generate_alphabet on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.

Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.

Deepayan137 commented 7 years ago

Sure, will do the necessary.

On May 24, 2017 8:40 PM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: Can you use aux/generate_alphabet on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.

Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303754595, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLjsSqrqJNevR1uZtlidFdRjCZvxNks5r9EhwgaJpZM4NhOQ4 .

Deepayan137 commented 7 years ago

Hi Jerin, I added the alphabets true structure for Hindi and also the n-gram approach for error correction.

Presently my error correction code takes vocabulary in a list format. I tried it on an English words corpus of 24k words and the results are ok. The suggestions might improve with a larger data set.

I will also try to add a context based module which will look at the preceding words and come up with probability for all the suggested words.

On May 24, 2017 9:13 PM, "deepayan das" deepayan137@gmail.com wrote:

Sure, will do the necessary.

On May 24, 2017 8:40 PM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: Can you use aux/generate_alphabet on the list of words and get the required alphabet. That entry was left empty in the Hindi.json configuration file. It's required to generate words at edit distance less than two for the dictionary suggestions.

Soon as that's done I can run experimentts for Hindi as well and produce stats. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303754595, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLjsSqrqJNevR1uZtlidFdRjCZvxNks5r9EhwgaJpZM4NhOQ4 .

jerinphilip commented 7 years ago

@Deepayan137: Where is this? Can you push the files over here, once it's presentable?

Deepayan137 commented 7 years ago

Alphabets is in the parameters/error/Hindi And error correction is in SRC/error_module/suggestions.py

I, tried pushing it but was facing the same permission issues.

On May 25, 2017 8:50 AM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: Where is this? Can you push the files over here, once it's presentable?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303911798, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLlks7oApOc1ubWaQKVnTEFBEhsXQks5r9POYgaJpZM4NhOQ4 .

jerinphilip commented 7 years ago

@Deepayan137: Use your local repo to create an push. Then pull on the /OCRData2 repo. It's a little bit overhead, but whatever you commit comes in your name. So.

Deepayan137 commented 7 years ago

Okay will do that.

On May 25, 2017 9:08 AM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: Use your local repo to create an push. Then pull on the /OCRData2 repo. It's a little bit overhead, but whatever you commit comes in your name. So.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303913928, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLtHoeKaDbiNKyzr9z81v5nTcvDNYks5r9PfNgaJpZM4NhOQ4 .

jerinphilip commented 7 years ago

@Deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.

Deepayan137 commented 7 years ago

Oh sorry my bad....I'll send you the text file containing all the characters in a single line.

On May 25, 2017 9:13 AM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303914411, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLhO-WLJ6fqFj4Ry4X8SilZ4F-hz0ks5r9PjegaJpZM4NhOQ4 .

Deepayan137 commented 7 years ago

Alphabet file sent. Please check

On May 25, 2017 9:58 AM, "deepayan das" deepayan137@gmail.com wrote:

Oh sorry my bad....I'll send you the text file containing all the characters in a single line.

On May 25, 2017 9:13 AM, "Jerin Philip" notifications@github.com wrote:

@Deepayan137 https://github.com/deepayan137: It's not supposed to be a trie. The alphabet, its a file containing a single line, consisting of all the characters of the languages' alphabet combined with possible symbols and other characters which might appear.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/jerinphilip/ocr-retrain/issues/4#issuecomment-303914411, or mute the thread https://github.com/notifications/unsubscribe-auth/AUKZLhO-WLJ6fqFj4Ry4X8SilZ4F-hz0ks5r9PjegaJpZM4NhOQ4 .

jerinphilip / ocr-retrain

Experiments: Post ARL #4