Closed AbhishekSen closed 7 years ago
Hey @AbhishekSen,
If relative bigram position doesn't matter you can generate "bigram:a b", "bigram:b c", "bigram:c d" features. If you think position matters you can name them like "bigram[-2]:a b", "bigram[-1]:b c", "bigram[+1]:c d". It is just a matter of coming up with a naming scheme.
@kmike I have a question regarding the bigram features(without relative bigram position) created using consecutive words. In the NER example you have mentioned that features should be in the form of dict. In a dict key can be mentioned only once.
Say if we take the wikipedia example:
Jim bought 300 shares of Acme Corp. in 2006.
the bigrams would be
a) Jim bought
b) bought 300
c) 300 shares
and so on
Can you show for the case where relative bigram position doesn't matter, how would you put these bigram features as input to your code?
Thanks in advance
@kaushikacharya in this NER example it'd be something like "bigram": "%s %s" % (word, word1)
in word2features
function.
@kmike Thanks for the above suggestion.
I have one more question. Here's the problem(not exactly same but similar) that I am trying to solve using crfsuite. Automatic classification of sentences to support Evidence Based Medicine
Though the classes in my problem are different than this, but I am also trying to classify each of the statements. There's a structure in the statements i.e. first few lines belong to say class 1, then next few lines belong to class 2 and so on. Count of few is variable as well as order of classes need not be same.
Page 4 of the paper mentions various type of Features they have used. I am also trying to use the Bag-of-Words, bigrams to start with. That's where I am confused on how to do it and need your help.
CRF api mentions X should be list of lists of dicts and y should be list of lists of strings. In case of NER, an element of outer list represents a sentence and the inner list represents features(in dict format) of each of the word (e.g. suffix, word itself, POS of previous/next word) of that sentence.
My understanding is that for my problem an element of outer list represents document and inner list represents features(in dict format) of each of the sentence. I am able to represent sentence features like isUpper, contains date etc.
But how to represent these features:
Should I use
I am trying to do Named Entity Recognition and one of the features that I have is the list of bigrams for each token. e.g. If we consider the sentence "a b c d e f g", Then the bigrams (with window size 5) that I'm considering for token say "c" are: [["a","b"], ["b","c"], ["c","d"], ["d","e"]]. Similarly for token "g" it will be: [["e","f"], ["f","g"]]. Now my issue is how should I go on and represent this as a feature in python-crfsuite ? Along with this I also have many boolean features.