nepalihackers / ideas

Gathering ideas to implement
Apache License 2.0
3 stars 0 forks source link

nepali ocr #1

Open techgaun opened 7 years ago

techgaun commented 7 years ago

@bravegurkha brought this up on Slack and I've always wanted to do so too.

prabinzz commented 7 years ago

An website scanner would be a nice project. Something like: grabbing all links, image links only, form scanner, GET and POST request detection And other stuff.

techgaun commented 7 years ago

@prabinzz can you create a new issue with some more details. Is it mostly for scraping or more of a recon work for security stuff. An extension would be to have a plugin system to run further exploitation tests on each features. For example, if we detect possibility of GET request params, we could see if we could have potential xss.. just a thought. Also, this github org needs revival so we could discuss on Slack and actually spend time hacking on code

prabinzz commented 7 years ago

@techgaun Ok. I'll try to explain. though my English is messed up.

techgaun commented 7 years ago

@prabinzz don't worry about it.. we can communicate more to be clear so its not a problem at all

prabinzz commented 7 years ago

@techgaun yeah.. I hope so. ☺️

swornim00 commented 7 years ago

We should start working on it ? I think python would be good for this project.

yalu commented 7 years ago

check this out!

https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md

it has hindi support already.

just jotting down what I've seen:

techgaun commented 7 years ago

@yalu I have used tesseract in the past (it was quite a while ago; maybe 4 years ago) for some tests and never found it to be accurate enough for Nepali texts. It had been a while and I tried it once again today with couple of images. I see that they also have lang data for Nepali (in addition to Hindi): https://github.com/tesseract-ocr/langdata/tree/master/nep

I just tried the hin and nep and I saw mixed rate of sucess with both but definitely there's a lot of room for improvement. We could either choose to build from scratch or contribute on improving the lang data and traineddata for nepali on tesseract project itself. I am more leaning towards our own implementation primarily focused on Nepali (with reasonable base support for any devanagari script based language) just out of my curiosity to explore this side of thing for implementing the details that tesseract or similar OCR engines already abstract for us.

I think I've come across the chrome extension you've mentioned before but never looked into how it worked.

Also, do you happen to know if the Ncell mobile app winner project was based on tesseract or similar engines? I will try to find the source code (in case they open sourced it) or paper, maybe.

yalu commented 7 years ago

I like the way you think @techgaun. agree any existing solutions out there not specifically focusing on Nepali are "lacking" to say the least. Going to the root of the problem and starting there is almost always a great idea. I too am curious about how it all works.

I have no idea if the app camp winner's code is open. I don't think so. I think YIPL folks probably knows more - they were one of the two organisers.