How to include nested list features ?

AbhishekSen commented 8 years ago

I am trying to do Named Entity Recognition and one of the features that I have is the list of bigrams for each token. e.g. If we consider the sentence "a b c d e f g", Then the bigrams (with window size 5) that I'm considering for token say "c" are: [["a","b"], ["b","c"], ["c","d"], ["d","e"]]. Similarly for token "g" it will be: [["e","f"], ["f","g"]]. Now my issue is how should I go on and represent this as a feature in python-crfsuite ? Along with this I also have many boolean features.

kmike commented 8 years ago

Hey @AbhishekSen,

If relative bigram position doesn't matter you can generate "bigram:a b", "bigram:b c", "bigram:c d" features. If you think position matters you can name them like "bigram[-2]:a b", "bigram[-1]:b c", "bigram[+1]:c d". It is just a matter of coming up with a naming scheme.

kaushikacharya commented 6 years ago

@kmike I have a question regarding the bigram features(without relative bigram position) created using consecutive words. In the NER example you have mentioned that features should be in the form of dict. In a dict key can be mentioned only once.

Say if we take the wikipedia example: Jim bought 300 shares of Acme Corp. in 2006. the bigrams would be a) Jim bought b) bought 300 c) 300 shares and so on

Can you show for the case where relative bigram position doesn't matter, how would you put these bigram features as input to your code?

Thanks in advance

kmike commented 6 years ago

@kaushikacharya in this NER example it'd be something like "bigram": "%s %s" % (word, word1) in word2features function.

kaushikacharya commented 6 years ago

@kmike Thanks for the above suggestion.

I have one more question. Here's the problem(not exactly same but similar) that I am trying to solve using crfsuite. Automatic classification of sentences to support Evidence Based Medicine

Though the classes in my problem are different than this, but I am also trying to classify each of the statements. There's a structure in the statements i.e. first few lines belong to say class 1, then next few lines belong to class 2 and so on. Count of few is variable as well as order of classes need not be same.

Page 4 of the paper mentions various type of Features they have used. I am also trying to use the Bag-of-Words, bigrams to start with. That's where I am confused on how to do it and need your help.

CRF api mentions X should be list of lists of dicts and y should be list of lists of strings. In case of NER, an element of outer list represents a sentence and the inner list represents features(in dict format) of each of the word (e.g. suffix, word itself, POS of previous/next word) of that sentence.

My understanding is that for my problem an element of outer list represents document and inner list represents features(in dict format) of each of the sentence. I am able to represent sentence features like isUpper, contains date etc.

But how to represent these features:

words of the sentence as bag of words
bigram words of the sentence

Should I use

CountVectorizer of sklearn (as used in newsgroup classification problem) OR
features['word1']=True, features['word2']=True etc for each of the words present in the sentence (in the way you have used features['BOS'], features['EOS'] in NER example)

scrapinghub / python-crfsuite

How to include nested list features ? #30