Open tomschr opened 8 years ago
Also, we should make sure NLTK solves a problem. :)
From my pov, SDSC's language department has room for improvement in these areas:
And it would be nice to add a solution for this area:
Finally, we need to check if there are license issues with NLTK (especially if we are using one of their corpora for anything).
Also, we should make sure NLTK solves a problem. :)
Wouldn't that be the case? ;) I'm still learning NLTK but what I've read so far makes me believe it can be helpful in those areas you've mentioned before.
Especially when regular expressions (which is the main part) has their limits, NLTK could give more accurate results. Of course, it boils down how serious are with style checking and if we want to refactor some parts.
Also, we should make sure NLTK solves a problem. :)
Wouldn't that be the case? ;)
I made this remark mostly because the initial report sounds a little like "let's add NLTK because that sounds fun". Which obviously is not what you wanted to say, hence my mentioning of areas where it might be helpful.
In other words, the initial bug report put the cart in front of the horse. The main focus should be on solving issues, not on adding tools. That is all.
Also, I think it might be helpful to make this a bit more generic. NLTK is probably not the only way to process natural language and it certainly is everything but lightweight. So, if there are alternatives, it would be good to explore those too.
(E.g. what is LanguageTool using?/Can we adapt that without gaining a Java dependency?)
In other words, the initial bug report put the cart in front of the horse. The main focus should be on solving issues, not on adding tools. That is all.
Right, I agree.
NLTK is probably not the only way to process natural language and it certainly is everything but lightweight. So, if there are alternatives, it would be good to explore those too.
I found Textblob: http://textblob.readthedocs.org / https://github.com/sloria/textblob It looks quite good and seems easier to use than NLTK.
what is LanguageTool using?/Can we adapt that without gaining a Java dependency
Not sure, haven't investigated much. It seems, it's a pure Java library.
Another alternative might be https://github.com/fnl/segtok It's a python-only library without any dependencies.
More tools to look at:
Two more things:
It seems the Natural Language Toolkit (NTLK) could be helpful addition to SDSC.
Some things to watch for: