sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Hunspell for Sanskrit? #91

Open funderburkjim opened 7 years ago

funderburkjim commented 7 years ago

A common form for dictionaries used for spell-checking by open source projects is Hunspell (there are variants Aspell, but Hunspell seems to be the most recent).

The following link may give enough details to actually make a dictionary for a new language (such as Sanskrit): hunspell format.

I'm mentioning this here so that this idea can be revisited when time permits.

drdhaval2785 commented 7 years ago

Daring venture. That would be something very useful for proofreading / autosuggestions for corrections in Sanskrit fields.

drdhaval2785 commented 7 years ago

Some resources

http://sanskritdocuments.org/hindi/hunspell https://github.com/Shreeshrii/hindi-hunspell http://samskrtam.ru/sanskrit-hunspell/

drdhaval2785 commented 7 years ago

http://sanskrit.uohyd.ernet.in/scl/spell_checker/

drdhaval2785 commented 7 years ago

Orangoo and Hunspell are some of the programs.

  1. Microsoft has plugins included in word 2003 onwards. The databases are called 'Dictionary' . For example, one needs to enable 'Hindi Dictionary' in the configuration settings. A help question answered for Bangla is as follows:

http://answers.microsoft.com/en-us/office/forum/office_2013_release-word/a-typical-problem-with-typing-and-spellcheck-in/f69ad8cf-ea77-489c-ae77-3ca0a29cc3f0

More generic help here:

http://office.microsoft.com/en-in/word-help/word-features-for-indic-languages-HP001036692.aspx

Database related progress (an old status) is reported here with a request for volunteers here :

http://www.slideshare.net/shantanuo/spell-check-in-indian-languages

Shilpa is an interesting project . Related links :

http://silpa.org.in/SpellCheck

http://lists.smc.org.in/pipermail/student-projects-smc.org.in/2014-February/000033.html

http://thottingal.in/projects/spellchecker/

There are probably more such initiatives elsewhere. Other members list may be able to add.

Regards,

Nagaraj

gasyoun commented 7 years ago

Hunspell seems to be the most recent

Sot it is. And the most widely used.

That would be something very useful for proofreading / autosuggestions for corrections in Sanskrit fields.

And it gives better line brakes as well, even in MS Word.

http://samskrtam.ru/sanskrit-hunspell/

Now you know my website. Can'f find all of the documentation, only 1 google docs file located.

Found it, https://docs.google.com/document/d/1Ktm-rMjZnOGFdwN7u7gE1WzkhohPcPMQA3XrRSvn56o/edit# from 2013. Sanskrit Hyphenation was developed for my Reverse Dictionary of Sanskrit, please ask - main document in Russian.

funderburkjim commented 7 years ago

If we work to develop a sanskrit hunspell dictionary, the Python interface to use such dictionaries is probably pyenchant. I have used this to do English and German spell checking. Web search shows there is a php 'wrapper' also, but I haven't tried it.

gasyoun commented 7 years ago

http://pythonhosted.org/pyenchant/tutorial.html indeed.

vipranarayan14 commented 3 years ago

@drdhaval2785 Namaste sir, For my research, I want to build a fully functional spelling and grammar checker like Grammarly and LanguageTool for Sanskrit. I am thinking of starting with the spell checker using Hunspell. Has there been any progress made in this? Or are there any other implementations?

drdhaval2785 commented 3 years ago

No progress on hunspell.

There were trials to use bigrams / trigrams frequency to find out potentially erroneous entries.

You may benefit from reading the following for spellchecker.

  1. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/46
  2. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/151
  3. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/178
  4. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/185
  5. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/198
  6. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/293

I had started a spell check exercise, but it was centric to Cologne Sanskrit dictionaries, and not to general spellcheck. https://github.com/drdhaval2785/SanskritSpellCheck#logic may give some thoughts which you may extrapolate.

https://github.com/sanskrit-lexicon/COLOGNE/issues/91#issuecomment-258378657 is regarding Sanskrit and Hindi hunspell.

drdhaval2785 commented 3 years ago

I remember that there was a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention. I am unable to locate that software link now. You may find on googling, @vipranarayan14 .

It would tend to overgenerate the grammar, but ic we take some balanced text like bhagavadgita or ramayana etc, which are quite representative of poetry literature, automatic generator should work just fine.

gasyoun commented 3 years ago

a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention

Never seen such. Sounds as pure magic.

funderburkjim commented 3 years ago

I think that pyenchant under the hood uses hunspell.

@vipranarayan14 -- please let us know if you make progress in this area.

vipranarayan14 commented 3 years ago

@drdhaval2785

Namaste sir,

No progress on hunspell.

Thank you for the info.

You may benefit from reading the following for spellchecker.

I think they will be useful.

vipranarayan14 commented 3 years ago

@gasyoun Sir, was this sanskrit-hunspell developed by you? If so, kindly let me know if there has been any progress.

vipranarayan14 commented 3 years ago

@drdhaval2785

a hunspell dictionary generator, which can generate grammar from corpus, without manual intervention

Sir, is it the affixcompress tool in Hunspell? The tool, according the Readme file, can generate a Hunspell dictionary from a wordlist.

vipranarayan14 commented 3 years ago

@funderburkjim Pyenchant under the hood uses Enchant. Enchant is a C++ library which wraps many spellchecking libraries such as Hunspell, Nuspell, etc.

drdhaval2785 commented 3 years ago

Sir, is it the affixcompress tool in Hunspell?

I have not used it. So no chance of remembering, alas. You will have to try and see.

vipranarayan14 commented 3 years ago

There is a discussion here also reg. making Hunspell for Sanskrit: https://github.com/Shreeshrii/hindi-hunspell/issues/1

vipranarayan14 commented 3 years ago

Sir, is it the affixcompress tool in Hunspell?

I have not used it. So no chance of remembering, alas. You will have to try and see.

Sure, sir. I will update the results here when I try it.

drdhaval2785 commented 3 years ago

@vipranarayan14 Any update on this front?

vipranarayan14 commented 3 years ago

I am currently working on it, sir.

So far, I have completed adding support for subantas and a major part of tiṅantas.

The dictionary currently supports spellchecking for:

gasyoun commented 3 years ago

@vipranarayan14 can we ask you to share the draft file, please?

vipranarayan14 commented 3 years ago

@gasyoun May I know the purpose, sir?

gasyoun commented 3 years ago

Have played with it myself 10 years ago.

vipranarayan14 commented 2 years ago

@gasyoun Sorry, sir, for the huge delay in reply.

I believe I can get valuable insights from sharing the draft with you all. But since it is being developed as a part of my PhD research, I discussed this with my supervisor and they recommend I do not share the file at this stage. So, I maybe able to share it later.

However, I'll keep you all posted on my progress.

drdhaval2785 commented 2 years ago

Best wishes for your Ph.D.

vipranarayan14 commented 2 years ago

@drdhaval2785 Thank you, sir.

drdhaval2785 commented 2 years ago

Any update @vipranarayan14 ?

vipranarayan14 commented 2 years ago

Due to personal commitments, I wasn't able to make significant progress for the last couple of months, sir. However, I'm going to complete it in a month or so.

Anyway, I have created a web interface for the dictionary. You can check it out here.

Please note, some of the correct words may also be marked as incorrect since they have not been added to the dictionary yet.

Also, sir, please don't share the link to any other forums or the like. Once the dictionary is complete I'll myself make it publicly accessible. I request the same from the other participants of the thread.

gasyoun commented 2 years ago

Also, sir, please don't share the link to any other forums or the like. Once the dictionary is complete I'll myself make it publicly accessible. I request the same from the other participants of the thread.

So be it. What are the known mistakes?

vipranarayan14 commented 2 years ago

@gasyoun By 'known mistakes' do you mean, false positives (misspelt words being recognised as correct) and false negatives (correctly spelt words being recognised as misspelt)?