openSUSE / suse-doc-style-checker

Style Checker for SUSE Documentation
Other
2 stars 5 forks source link

Research better tokenization/sentence segmentation/adding PoS tagging #5

Open tomschr opened 8 years ago

tomschr commented 8 years ago

It seems the Natural Language Toolkit (NTLK) could be helpful addition to SDSC.

Some things to watch for:

ghost commented 8 years ago

Also, we should make sure NLTK solves a problem. :)

From my pov, SDSC's language department has room for improvement in these areas:

And it would be nice to add a solution for this area:

Finally, we need to check if there are license issues with NLTK (especially if we are using one of their corpora for anything).

tomschr commented 8 years ago

Also, we should make sure NLTK solves a problem. :)

Wouldn't that be the case? ;) I'm still learning NLTK but what I've read so far makes me believe it can be helpful in those areas you've mentioned before.

Especially when regular expressions (which is the main part) has their limits, NLTK could give more accurate results. Of course, it boils down how serious are with style checking and if we want to refactor some parts.

ghost commented 8 years ago

Also, we should make sure NLTK solves a problem. :)

Wouldn't that be the case? ;)

I made this remark mostly because the initial report sounds a little like "let's add NLTK because that sounds fun". Which obviously is not what you wanted to say, hence my mentioning of areas where it might be helpful.

In other words, the initial bug report put the cart in front of the horse. The main focus should be on solving issues, not on adding tools. That is all.

Also, I think it might be helpful to make this a bit more generic. NLTK is probably not the only way to process natural language and it certainly is everything but lightweight. So, if there are alternatives, it would be good to explore those too.

(E.g. what is LanguageTool using?/Can we adapt that without gaining a Java dependency?)

tomschr commented 8 years ago

In other words, the initial bug report put the cart in front of the horse. The main focus should be on solving issues, not on adding tools. That is all.

Right, I agree.

NLTK is probably not the only way to process natural language and it certainly is everything but lightweight. So, if there are alternatives, it would be good to explore those too.

I found Textblob: http://textblob.readthedocs.org / https://github.com/sloria/textblob It looks quite good and seems easier to use than NLTK.

what is LanguageTool using?/Can we adapt that without gaining a Java dependency

Not sure, haven't investigated much. It seems, it's a pure Java library.

ghost commented 8 years ago
Vogtinator commented 8 years ago

Another alternative might be https://github.com/fnl/segtok It's a python-only library without any dependencies.

ghost commented 8 years ago

More tools to look at:

ghost commented 7 years ago

Two more things: