Open techgaun opened 7 years ago
An website scanner would be a nice project. Something like: grabbing all links, image links only, form scanner, GET and POST request detection And other stuff.
@prabinzz can you create a new issue with some more details. Is it mostly for scraping or more of a recon work for security stuff. An extension would be to have a plugin system to run further exploitation tests on each features. For example, if we detect possibility of GET request params, we could see if we could have potential xss.. just a thought. Also, this github org needs revival so we could discuss on Slack and actually spend time hacking on code
@techgaun Ok. I'll try to explain. though my English is messed up.
@prabinzz don't worry about it.. we can communicate more to be clear so its not a problem at all
@techgaun yeah.. I hope so. ☺️
We should start working on it ? I think python would be good for this project.
check this out!
https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md
it has hindi support already.
just jotting down what I've seen:
@yalu I have used tesseract in the past (it was quite a while ago; maybe 4 years ago) for some tests and never found it to be accurate enough for Nepali texts. It had been a while and I tried it once again today with couple of images. I see that they also have lang data for Nepali (in addition to Hindi): https://github.com/tesseract-ocr/langdata/tree/master/nep
I just tried the hin
and nep
and I saw mixed rate of sucess with both but definitely there's a lot of room for improvement. We could either choose to build from scratch or contribute on improving the lang data and traineddata for nepali on tesseract project itself. I am more leaning towards our own implementation primarily focused on Nepali (with reasonable base support for any devanagari script based language) just out of my curiosity to explore this side of thing for implementing the details that tesseract or similar OCR engines already abstract for us.
I think I've come across the chrome extension you've mentioned before but never looked into how it worked.
Also, do you happen to know if the Ncell mobile app winner project was based on tesseract or similar engines? I will try to find the source code (in case they open sourced it) or paper, maybe.
I like the way you think @techgaun. agree any existing solutions out there not specifically focusing on Nepali are "lacking" to say the least. Going to the root of the problem and starting there is almost always a great idea. I too am curious about how it all works.
I have no idea if the app camp winner's code is open. I don't think so. I think YIPL folks probably knows more - they were one of the two organisers.
@bravegurkha brought this up on Slack and I've always wanted to do so too.