Open funderburkjim opened 7 years ago
Daring venture. That would be something very useful for proofreading / autosuggestions for corrections in Sanskrit fields.
Orangoo and Hunspell are some of the programs.
More generic help here:
http://office.microsoft.com/en-in/word-help/word-features-for-indic-languages-HP001036692.aspx
Database related progress (an old status) is reported here with a request for volunteers here :
http://www.slideshare.net/shantanuo/spell-check-in-indian-languages
Shilpa is an interesting project . Related links :
http://silpa.org.in/SpellCheck
http://lists.smc.org.in/pipermail/student-projects-smc.org.in/2014-February/000033.html
http://thottingal.in/projects/spellchecker/
There are probably more such initiatives elsewhere. Other members list may be able to add.
Regards,
Nagaraj
Hunspell seems to be the most recent
Sot it is. And the most widely used.
That would be something very useful for proofreading / autosuggestions for corrections in Sanskrit fields.
And it gives better line brakes as well, even in MS Word.
Now you know my website. Can'f find all of the documentation, only 1 google docs file located.
Found it, https://docs.google.com/document/d/1Ktm-rMjZnOGFdwN7u7gE1WzkhohPcPMQA3XrRSvn56o/edit# from 2013. Sanskrit Hyphenation was developed for my Reverse Dictionary of Sanskrit, please ask - main document in Russian.
If we work to develop a sanskrit hunspell dictionary, the Python interface to use such dictionaries is probably pyenchant. I have used this to do English and German spell checking. Web search shows there is a php 'wrapper' also, but I haven't tried it.
@drdhaval2785 Namaste sir, For my research, I want to build a fully functional spelling and grammar checker like Grammarly and LanguageTool for Sanskrit. I am thinking of starting with the spell checker using Hunspell. Has there been any progress made in this? Or are there any other implementations?
No progress on hunspell.
There were trials to use bigrams / trigrams frequency to find out potentially erroneous entries.
You may benefit from reading the following for spellchecker.
I had started a spell check exercise, but it was centric to Cologne Sanskrit dictionaries, and not to general spellcheck. https://github.com/drdhaval2785/SanskritSpellCheck#logic may give some thoughts which you may extrapolate.
https://github.com/sanskrit-lexicon/COLOGNE/issues/91#issuecomment-258378657 is regarding Sanskrit and Hindi hunspell.
I remember that there was a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention. I am unable to locate that software link now. You may find on googling, @vipranarayan14 .
It would tend to overgenerate the grammar, but ic we take some balanced text like bhagavadgita or ramayana etc, which are quite representative of poetry literature, automatic generator should work just fine.
a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention
Never seen such. Sounds as pure magic.
I think that pyenchant under the hood uses hunspell.
@vipranarayan14 -- please let us know if you make progress in this area.
@drdhaval2785
Namaste sir,
No progress on hunspell.
Thank you for the info.
You may benefit from reading the following for spellchecker.
I think they will be useful.
@gasyoun Sir, was this sanskrit-hunspell developed by you? If so, kindly let me know if there has been any progress.
@drdhaval2785
a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention
Sir, is it the affixcompress tool in Hunspell? The tool, according the Readme file, can generate a Hunspell dictionary from a wordlist.
@funderburkjim Pyenchant under the hood uses Enchant. Enchant is a C++ library which wraps many spellchecking libraries such as Hunspell, Nuspell, etc.
Sir, is it the affixcompress tool in Hunspell?
I have not used it. So no chance of remembering, alas. You will have to try and see.
There is a discussion here also reg. making Hunspell for Sanskrit: https://github.com/Shreeshrii/hindi-hunspell/issues/1
Sir, is it the affixcompress tool in Hunspell?
I have not used it. So no chance of remembering, alas. You will have to try and see.
Sure, sir. I will update the results here when I try it.
@vipranarayan14 Any update on this front?
I am currently working on it, sir.
So far, I have completed adding support for subantas and a major part of tiṅantas.
The dictionary currently supports spellchecking for:
@vipranarayan14 can we ask you to share the draft file, please?
@gasyoun May I know the purpose, sir?
Have played with it myself 10 years ago.
@gasyoun Sorry, sir, for the huge delay in reply.
I believe I can get valuable insights from sharing the draft with you all. But since it is being developed as a part of my PhD research, I discussed this with my supervisor and they recommend I do not share the file at this stage. So, I maybe able to share it later.
However, I'll keep you all posted on my progress.
Best wishes for your Ph.D.
@drdhaval2785 Thank you, sir.
Any update @vipranarayan14 ?
Due to personal commitments, I wasn't able to make significant progress for the last couple of months, sir. However, I'm going to complete it in a month or so.
Anyway, I have created a web interface for the dictionary. You can check it out here.
Please note, some of the correct words may also be marked as incorrect since they have not been added to the dictionary yet.
Also, sir, please don't share the link to any other forums or the like. Once the dictionary is complete I'll myself make it publicly accessible. I request the same from the other participants of the thread.
Also, sir, please don't share the link to any other forums or the like. Once the dictionary is complete I'll myself make it publicly accessible. I request the same from the other participants of the thread.
So be it. What are the known mistakes?
@gasyoun By 'known mistakes' do you mean, false positives (misspelt words being recognised as correct) and false negatives (correctly spelt words being recognised as misspelt)?
A common form for dictionaries used for spell-checking by open source projects is Hunspell (there are variants Aspell, but Hunspell seems to be the most recent).
The following link may give enough details to actually make a dictionary for a new language (such as Sanskrit): hunspell format.
I'm mentioning this here so that this idea can be revisited when time permits.