nrcpp / NltkNet

NLTK library wrapper for .NET
MIT License
46 stars 8 forks source link

NltkNet

Build Status NltkNet Logo

C# wrapper for NLTK library (http://nltk.org)

Frameworks supported

Dependencies

Pre-Requirements

Before start using NltkNet wrapper it is required to download and install latest IronPython binaries from official site. You will need IronPython standard libraries for NLTK, as well as installing NLTK library for IronPython. Also IronPython interpreter will be helpful to test python scripts interactively from Visual Studio or command line.

It is expectable that most developers already have experience with NLTK library using Python and looking for a way to use in C#. Guides in this section are mostly for a developers who just started learning NLP using NLTK and haven't much experience with Python.

Installing IronPython

Add IronPython environment to Visual Studio

Install NLTK library for IronPython

There are different ways to install nltk library. If you have experience with using Python and installing packages then everything is clear here.

Install NLTK corpuses

Corpus(plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

NLTK library contains lots of ready-to-use corpuses which usually stores as a set of text files. It will be useful to load certain corpus on studying NLP using NLTK library, instead of creating it from scratch.

If you're using NLTK library for learning NLP, download NLTK book related corpuses and linguistic data.

Use such script either from Visual Studio Python Interactive Window or Iron Python command line to do so:

import nltk 
import nltk.corpus
nltk.download('book')

Getting Started

When whole third-party stuff is in-place then we are ready to test NltkNet. Install NltkNet nuget package using your usual way. For example from Package Manager Console by pasting:

Install-Package NltkNet

Use this code to initialize paths to IronPython standard and third-party libraries:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using NltkNet;

namespace TestApp
{
    class Program
    {        
        static void Main(string[] args)
        {
            Nltk.Init(new List<string>
            {
                @"C:\IronPython27\Lib",                 // Path to IronPython standard libraries
                @"C:\IronPython27\Lib\site-packages",   // Path to IronPython third-party libraries
            });           
        }
    }
}

More practical samples

using System;
using System.Collections.Generic;
using System.Dynamic;
using System.IO;
using System.Linq;
using NltkNet;

namespace TestApp
{
    class Program
    {
        static string text = "This IronPython script works fine when I run it by itself.";

        private static void TestNltkResultClass()
        {
            var corpus = new Nltk.Corpus.Inaugural();

            // example of using NltkResult class
            var fileidsResult = corpus.FileIds();

            // Get .NET List<string>
            List<string> fileidsNet = fileidsResult.AsNet;                          

            // Get IronPython.Runtime.List
            IronPython.Runtime.List fileidsPython = fileidsResult.AsPython;         

            // Cast to Dynamic to access object fields in Python-like style
            dynamic fileids = fileidsResult;                                        

            // using DynamicObject
            Console.WriteLine(fileids[0]);
            Console.WriteLine(fileids.__len__());

            // access sentences (list of list of strings)
            var sentencesResult = corpus.Sents();
            dynamic sentences = sentencesResult;

           // Manipulating with Python object: first word in first sentense
            Console.WriteLine(sentences[0][0]);        
            List<List<string>> netSentences = sentencesResult.AsNet;

            Console.WriteLine(netSentences[0][0]);              // the same with .NET object
            Console.WriteLine(netSentences.First().First());     // using LINQ
        }

        static void TestTokenize()
        {            
            var tuples = Nltk.Tokenize.Util.RegexpSpanTokenize(text, "\\s");

            var list = Nltk.Tokenize.SentTokenize(text).AsNet;            
            foreach (var item in list)
                Console.Write(item + ", ");
        }

        static void TestProbability()
        {
            var words = Nltk.Tokenize.WordTokenize(text);
            var fd = new Nltk.Probability.FreqDist(words.AsPython);

            var result = fd.MostCommon(null).AsNet;
            foreach (var item in result)
                Console.WriteLine(item.Key + ": " + item.Value);
        }

        static void TestStem()
        {
            var stemmer = new Nltk.Stem.PorterStemmer();
            var words = new List<string>() { "program", "programs", "programmer", "programming", "programmers" };
            var stem = stemmer.Stem("girls");

            Console.WriteLine("Stem: " +stem);

            var lemmatizer = new Nltk.Stem.WordNetLemmatizer();
            Console.WriteLine("Lemmatize: " + lemmatizer.Lemmatize("best"));
        }

        private static void TestCorpus()
        {
            // NOTE: brown corpus have to be installed. By default to %appdata%\nltk_data\corpora\brown
            // See https://github.com/nrcpp/NltkNet/blob/master/NltkNet/Nltk/Nltk.Corpus.cs for more corpora
            var corpus = new Nltk.Corpus.Brown();

            var fileidsResult = corpus.FileIds();
            List<string> fileidsNet = fileidsResult.AsNet;
            dynamic fileids = fileidsResult;

            Console.WriteLine(fileids[0]);

            var words = corpus.Words(fileidsNet.First());
            var sentences = corpus.Sents(fileidsNet.First());
            var paragraphs = corpus.Paras(fileidsNet.First());
            string text = corpus.Raw(fileidsNet.First());
            var taggedWords = corpus.TaggedWords(fileidsNet.First());

            var stopWordsCorpus = new Nltk.Corpus.StopWords();
            var stopWords = stopWordsCorpus.Words(null);

            // Process given 
            Console.WriteLine("Stopwords: \r\n" + string.Join(", ", stopWords));
            Console.WriteLine("Words from Brown corpus: \r\n" + string.Join(", ", words));
        }

        static void Main(string[] args)
        {
            Nltk.Init(new List<string>
            {
                @"C:\IronPython27\Lib",
                @"C:\IronPython27\Lib\site-packages",
            });

            TestNltkResultClass();
            TestCorpus();
            TestTokenize();
            TestProbability();
            TestStem();
        }
    }
}

What if there is no required NLTK feature in wrapper?

NltkNet wrapper may be considered as a starting point for learning NLP using C# and Visual Studio. Current version of NltkNet does not cover lots of features of original NLTK library written in Python. You may use workarounds to execute Python code that didn't wrapped yet. First is direct access to Nltk.Py property which provides you ability to execute any IronPython script, including wrappers arround method calls and creating objects. Consider the below example that illustrates possibility of using unwrapped features of NLTK:

using NltkNet;
using System;
using System.Collections.Generic;
namespace TestApp
{
    class Workarounds
    {
        public static void TestPurePython()
        {
            // Initialization required
            Nltk.Init(new List<string>
            {
                @"C:\IronPython27\Lib",
                @"C:\IronPython27\Lib\site-packages",
            });

            // Imports NLTK corpus module
            Nltk.Py.ImportModule("nltk.corpus");

            // Import 'names' object to access corpus content
            Nltk.Py.ExecuteScript("from nltk.corpus import names");

            // Get object by name
            dynamic namesObj = Nltk.Py.GetObject("names");

            // Call object's method 'names.words()'
            dynamic namesList = Nltk.Py.CallMethod(namesObj, "words");

            foreach (var name in namesList)
                Console.Write(name + ", ");
        }
    }
}

Other examples that uses IronPython built-in functions:

using NltkNet;

namespace TestApp
{
    using System;
    using System.Collections.Generic;

    // Using NltkNet.BuiltIn static class to access Print, Str, Len, List, Sorted, Import and other IronPython built-in functions.
    // See https://ironpython-test.readthedocs.io/en/latest/library/functions.html for details
    using static NltkNet.BuiltIns;

    static class TestBuiltIn
    {        
        public static void OverallTest()
        {
            Nltk.Init(new List<string>()
            {
                @"C:\IronPython27\Lib",
                @"C:\IronPython27\Lib\site-packages",
            });

            TestImport();
            TestStandard();
        }

        // Using '__import__' built-in to import 'nltk.corpus.brown' and 'nltk.corpus.wordnet'
        public static void TestImport()
        {            
            dynamic corpuses = ImportNames("nltk.corpus", "brown", "wordnet");
            dynamic brown = corpuses.brown;
            dynamic wordnet = corpuses.wordnet;

            Print(brown.words(brown.fileids()[0]));
        }

        // Test standard python 2.7 functions (camel-case): Len,Str,List,Sorted,Range,Zip etc.
        public static void TestStandard()
        {
            var lst1 = new List<int> { 5, 4, 3, 2, 1, 5, 4, 3, 2, 1 };
            var lst2 = new List<int> { 10, 20, 30, 40, 50 };
            var lst3 = new List<string>() { "A", "B", "C", "D" };

            Print("Len: " + Len(lst1));
            Print("Sorted: " + Str(Sorted(lst1)));
            var tuple = (1, 2, "str");

            Print("Tuple2List: " + tuple);
            Print(List(tuple));

            Print("Set: " + Str(Set(lst1)));
            var range = Range(0, 30, 3);
            Print("Range: " + Str(List(range)));

            Print("Zip: " + Str(Zip(lst1, lst2)));

            Print(Globals());
            Console.ReadLine();            
        }
    }
}