languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.43k stars 1.4k forks source link

prepare a new version of en_gb dictionary and add word frequency information #73

Closed milekpl closed 10 years ago

milekpl commented 10 years ago

Use the file:

https://addons.mozilla.org/en-US/firefox/addon/british-english-dictionary-2

and include the frequency information.

The process of creating binary spelling dictionaries is described here:

http://wiki.languagetool.org/hunspell-support

Mailaender commented 10 years ago

I tried it in https://github.com/languagetool-org/languagetool/pull/108 although http://marcoagpinto.cidadevirtual.pt/en_GB_README.html looks spurious.

marcoagpinto commented 10 years ago

Hi!

The URL is: http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html

You can see there on the top right of the window, a link to Mozilla and other to Apache OpenOffice.

The better is to click on the Mozilla link, download the .XPI and rename it to .ZIP and then extract the .AFF and .DIC .

If you have any questions, please ask.

PS->Please notice that within a couple of days I am going to release V2.12 with some 300 or 400 new words.

Kind regards,

Marco A.G.Pinto

On 28/04/2014 12:54, Matthias Mailänder wrote:

I tried it in #108 https://github.com/languagetool-org/languagetool/pull/108 although the source http://marcoagpinto.cidadevirtual.pt/en_GB_README.html looks spurious.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/73#issuecomment-41549685.

Mailaender commented 10 years ago

I extracted the .XPI and got the AFF/DIC from there. Thanks. However, when I read "forked from" I found https://addons.mozilla.org/de/firefox/addon/british-english-dictionary-/ linked below which is again an unmaintained package improving on a repackage of an update. Isn't there a central site at @mozilla or @libreoffice where everyone contributes new words instead of this handing over? Also upstream at http://hunspell.sourceforge.net/ seems pretty dead.

marcoagpinto commented 10 years ago

Well,

I am the official English dictionaries maintainer at OpenOffice: http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice

You see "forked by Marco Pinto" (me) there, because the original people in charge vanished long ago.

Lucas something grabbed the original Mozilla extension and added "(updated)" to it... but all he has done was to change the version number of Firefox and Thunderbird so that people could use the old extension.

So, I grabbed the project myself and called it a "forked version". So far I believe I have added 3000+ words since I first placed my hands on it.

The base of my words is: http://www.oxforddictionaries.com

Every time I want to add a new word I go to the link above to check if the word exists.

LibreOffice en_GB is not a good idea. I know they have 600'000 words or so in their .DIC . But they unmunched the original .DIC and it got all corrupted. If you edit the LO file you will see all kind of garbage.

:-)

Kind regards,

Marco A.G.Pinto

On 28/04/2014 13:07, Matthias Mailänder wrote:

I extracted the .XPI and got the AFF/DIC from there. Thanks. However, when I read "forked from" I found https://addons.mozilla.org/de/firefox/addon/british-english-dictionary-/ linked below which is again an unmaintained package improving on a repackage of an update. Isn't there a central site at @mozilla https://github.com/mozilla or @libreoffice https://github.com/libreoffice where everyone contributes new words instead of this handing over? Also upstream at http://hunspell.sourceforge.net/ seems pretty dead.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/73#issuecomment-41550627.

Mailaender commented 10 years ago

You may want to get your latest version linked at https://addons.mozilla.org/de/firefox/language-tools/ because otherwise it is pretty hidden on that huge site. Thanks for your work. I know first hand that maintaining these files is a lot of effort and editing can be annoying especially if you validate each new entry by hand. See also https://git.eclipse.org/r/#/c/17076/ as @Eclipse also maintains it's own spell checker files. I believe all those projects might want a common upstream where you can collaborate on high-quality and up-to-date dictionaries. Maybe with a wiki/transifex like web interface so the barrier to contribute is low.

danielnaber commented 10 years ago

Marco, where do you maintain the dictionary? Is is at github? I think it should be, as this makes following the changes easy.

marcoagpinto commented 10 years ago

My dear Daniel,

I have the files stored in my hard disk.

I use my tool "Proofing Tool GUI" to edit.

Every month I release a new version of the dictionary with some 200 or 300 new words.

Then I upload to Mozilla and someone there reviews and validates the add-on.

The list of changes appear in my Mozilla add-on page and on my homepage in the links in the top right of the window.

In OpenOffice things take longer since I was told to only update the dictionaries in June, so that after OpenOffice 4.1 is released, people won't have to immediately download new dictionaries.

In OpenOffice I will also update the en_US and en_CA dictionaries in June, since Kevin Atkinson from Aspell sent me the most recent files in January.

I also noticed that Kevin's files have thousands of words less than my en_GB and I sent an e-mail to Kevin weeks ago with my list of words, so that he could check if they exist in US and CA but so far no reply. I will annoy Kevin in a week or so again.

Kind regards,

Marco A.G.Pinto

On 28/04/2014 13:53, Daniel Naber wrote:

Marco, where do you maintain the dictionary? Is is at github? I think it should be, as this makes following the changes easy.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/73#issuecomment-41554230.

danielnaber commented 10 years ago

To help establish your version as the new standard, I suggest keeping it in github. You can still edit it with your tools. github is the place where people will look for these kinds of things. The README.md file could also contain information about the history of the dictionary and the criteria for inclusion of words.

marcoagpinto commented 10 years ago

How, Daniel?

:)

I am kind of naive with GitHub and SVN.

I only know the basic stuff: checkout, update and commit.

:)

On 28/04/2014 21:21, Daniel Naber wrote:

To help establish your version as the new standard, I suggest keeping it in github. You can still edit it with your tools. github is the place where people will look for these kinds of things. The README.md file could also contain information about the history of the dictionary and the criteria for inclusion of words.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/73#issuecomment-41608290.

danielnaber commented 10 years ago

Hi Marco, you can create a new repository at https://github.com/marcoagpinto?tab=repositories by clicking "New". The rest works just as with LanguageTool, only the address of the first checkout will differ (github will show the address once the repo has been created).

Mailaender commented 10 years ago

https://help.github.com/ is always a good read. There are also several detailed tutorials available:

vogella commented 10 years ago

I think this bug is about updating languagetool with the latest version of https://github.com/marcoagpinto/aoo-mozilla-en-dict. If @marcoagpinto would use a fixed order structure (see https://github.com/marcoagpinto/aoo-mozilla-en-dict/issues/9) you could use Git submodules to include the latest version of his directories directly.

As of Git 1.8.2 you can setup submodules to track the master branch, see for example http://www.vogella.com/tutorials/Git/article.html#submodules_trackbranch

Mailaender commented 10 years ago

I did it manually https://github.com/languagetool-org/languagetool/pull/108 for now, but some kind of automation be it git sub modules or simple fetch and unzip scripts may make sense as @marcoagpinto is doing regular monthly releases.

danielnaber commented 10 years ago

This (word frequency information) is fixed now, #108 is still open.