Add ignore regex for text, not only words

carlos-jenkins commented 12 years ago

I would like, for example, that the library ignores links, or another 'several words' regexes. For example

 # This doesn't work: spellcheck library uses pango word breaking
 # algorythms, and thus a link is never evaluated as a whole, so
 # the regex will never math :S FIXME
 #regex_bank = txt2tags.getRegexes()
 #self.spell_checking.append_ignore_regex(regex_bank['link'].pattern)

Jeje, didn't know how to fix it without breaking everything :P And that was and excuse to tell you:

Gtkspell really sucks, so thanks, this is a library that actually do the job. I backported it to PyGtk but documented very well what is needed to run it again in PyGObject (I can't migrate to PyGObject right now, but will do in the future). Just 6 lines needs to be uncommented to make it work on PyGObject.

I added several stuff like the ability to disable the spellchecking (so I can have a button to enable/disable spell checking), and setup an alternate path to find dictionaries (useful on MS Windows).

You can find it here: http://sourceforge.net/p/nestededitor/code/235/tree/trunk/nested/modules/ (textviews/spellcheck and locales).

Check if you want oxt_import, it allows to use OpenOffice/LibreOffice .oxt dictionaries. An example on how I'm using it is here: http://sourceforge.net/p/nestededitor/code/235/tree/trunk/nested/nested_gui.py , line 210 to 258

Also check locales, I've changed code for gettext and removed all the mo's and po's because on a Ubuntu/Debian system they are packaged in package iso-codes, so they are not required.

Kind regards

koehlma commented 12 years ago

I am happy that this is useful for some people :)

Wordbreaking is really complicated, because sometimes for example a slash should be the end of a would and sometimes not. Maybe the regexes should be applied on the whole line so that the algorithm know, what ranges to ignore.

In general this library is more or less a copy of the original GtkspellCheck in C. So I have copied all the mistakes, they made. Because of that I think I have to rewrite the whole library. Also a documentation is needed.

Nested is a specialized editor focused on creating structured documents such
as reports, publications, presentations, books, etc. It is designed to help the user
concentrate on writing content without been distracted by format or markup. It
offers a rich WYSIWYM interface where the user writes plain text with a
lightweight markup language.

This is in fact the kind of program I wrote this for. I need such a program in school but also with some WYSIWYG feature for equations because this LaTeX stuff is to complicated for quick school usage.

koehlma commented 12 years ago

Fixed... :)

Use something like:

spellchecker.append_line_regex('http://[a-zA-Z.0-9/]*')

Do not use something like:

spellchecker.append_line_regex('http://.*')

This will match the rest of the line, so everything after the link will be ignored.

Have a look at the commit changes to backport it to PyGTK.

Nevertheless the whole thing needs a rewrite.

carlos-jenkins commented 12 years ago

Ash I don't have permissions to reopen this stuff... anyway, the problem with that is that I think this will be too hacky and in practice not very useful, I can still have:

"Please dounlod this documen fronm [http://someurl.com/something]" or wryte to [somebody@whatever.com]"

And with the current implementation more will be removed that what actually is necessary. Think about this approach, I'm not very handly yet with marks and tags but hey:

create a new tag, lets called it "dont-highlight-me" jeje,
for each line, try to match each line_regex, get iter for match boundaries (iter_match_start, iter_match_end).
apply the new tag
now, when scanning for spellchecking errors, ignore a word that has the tag and jump to the end of that tag

I that doesn't work, you can still do (really ugly):

do what you did for spellchecking, words by word using Pango word breaking algorythm.
for each line, try to match each line_regex, get iter for match boundaries (iter_match_start, iter_match_end).
remove spellchecking tags between those iters.

I already backported the new "swap textbuffers" thing you commited. I'll wait for see how we solve this. Also check this: http://sourceforge.net/p/nestededitor/code/236/tree/trunk/nested/modules/buffers/markup_buffer.py And an example that uses it: http://sourceforge.net/p/nestededitor/code/236/tree/trunk/nested/modules/bibmm/bibtexbuffer.py It has really cool logic about matching regex in gtk.TextBuffers, maybe it can help.

I'll try tomorrow to sketch something too.

Kind regards

koehlma commented 12 years ago

In the new version everything between the regex boundaries will be ignored and there are also predefined regexes for urls, emails and numbers. The new version also allows multiline ignore regexes.

This buffer markup logic is indeed very cool, but I think it's a bit overkill for a spellchecking library. Maybe I could add a "no-spell-check" tag so everyone who want to do his own ignore this and that could do this.

The "swap textbuffers" thing could be done automatic by a textbuffer change event but the problem is, that this event is also fired when the current buffer is destroyed.

carlos-jenkins commented 12 years ago

Great, you re-wrote the best spell checking library in a day :P e-mail and link recognition is a black art. Consider the regexes used by the library txt2tags:

>>> import txt2tags
>>> regexes = txt2tags.getRegexes()
>>> print regexes['link'].pattern
\b((https?|ftp|news|telnet|gopher|wais)://([A-Za-z0-9_.-]+(:[^ @]*)?@)?|(www[23]?|ftp)\.)[A-Za-z0-9%._/~:,=$@&+-]+\b/*(\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?(#[A-Za-z0-9%._-]*)?|\b[A-Za-z0-9_.-]+@([A-Za-z0-9_-]+\.)+[A-Za-z]{2,4}\b(\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?
>>> print regexes['email'].pattern
\b[A-Za-z0-9_.-]+@([A-Za-z0-9_-]+\.)+[A-Za-z]{2,4}\b(\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?
>>> regexes['link'].pattern
'\\b((https?|ftp|news|telnet|gopher|wais)://([A-Za-z0-9_.-]+(:[^ @]*)?@)?|(www[23]?|ftp)\\.)[A-Za-z0-9%._/~:,=$@&+-]+\\b/*(\\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?(#[A-Za-z0-9%._-]*)?|\\b[A-Za-z0-9_.-]+@([A-Za-z0-9_-]+\\.)+[A-Za-z]{2,4}\\b(\\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?'
>>> regexes['email'].pattern
'\\b[A-Za-z0-9_.-]+@([A-Za-z0-9_-]+\\.)+[A-Za-z]{2,4}\\b(\\?[A-Za-z0-9/%&=+:;.,$@*_-]+)?'

Kind regards

koehlma commented 12 years ago

And I also passed my theoretical driving license exam... Yeah, a really good day and happy backporting :D

koehlma / pygtkspellcheck

Add ignore regex for text, not only words #2