iit-cs585 / assignments

Assignments for IIT CS585
3 stars 7 forks source link

feature dictionary #20

Closed ranjeetkumar closed 7 years ago

ranjeetkumar commented 7 years ago

Hi professor/TA, The result of all a3_test.py tests passed. When I run a3.py, I am getting different result from log.txt. Even the shape of training and testing were different My output: training data shape: (27858, 18724) testing data shape: (28028, 18724) Expected output: training data shape: (27858, 18287) testing data shape: (28028, 18287)

All the rest output also have rippled effect. I am suspecting something wrong in feature dictionary. Suppose we have the following data: [[('EU', 'NNP', 'I-NP', 'I-ORG'), ('rejects', 'VBZ', 'I-VP', 'O'), ('German', 'JJ', 'I-NP', 'I-MISC')] My code generate the following feature dictionary: [{'chunk=I-NP': 1, 'pos=NNP': 1, 'next_chunk=I-VP': 1, 'tok=eu': 1, 'next_tok=rejects': 1, 'is_caps': 1, 'next_pos=VBZ': 1}, {'pos=VBZ': 1, 'next_chunk=I-NP': 1, 'prev_tok=eu': 1, 'next_pos=JJ': 1, 'tok=rejects': 1, 'chunk=I-VP': 1, 'prev_chunk=I-NP': 1, 'next_tok=german': 1, 'prev_pos=NNP': 1}, {'chunk=I-NP': 1, 'next_tok=call': 1, 'prev_chunk=I-VP': 1, 'is_caps': 1, 'prev_tok=rejects': 1, 'next_chunk=I-NP': 1, 'tok=german': 1, 'next_pos=NN': 1, 'prev_pos=VBZ': 1, 'pos=JJ': 1}]

Is this correct? I have double check my code still, I am not able to figure out where I am going wrong. Any hints why I am getting different shape of data? I am ignoring "-DOCSTART- -X- -X- O" and newline from the train and test data.

theTechie commented 7 years ago

I believe we should not be ignoring "-DOCSTART- -X- -X- O" On Sun, 12 Mar 2017 at 10:53 PM, Ranjeet kumar notifications@github.com wrote:

Hi professor/TA, The result of all a3_test.py tests passed. When I run a3.py, I am getting different result from log.txt. Even the shape of training and testing were different My output: training data shape: (27858, 18724) testing data shape: (28028, 18724) Expected output: training data shape: (27858, 18287) testing data shape: (28028, 18287)

All the rest output also have rippled effect. I am suspecting something wrong in feature dictionary. Suppose we have the following data: [[('EU', 'NNP', 'I-NP', 'I-ORG'), ('rejects', 'VBZ', 'I-VP', 'O'), ('German', 'JJ', 'I-NP', 'I-MISC')] My code generate the following feature dictionary: [{'chunk=I-NP': 1, 'pos=NNP': 1, 'next_chunk=I-VP': 1, 'tok=eu': 1, 'next_tok=rejects': 1, 'is_caps': 1, 'next_pos=VBZ': 1}, {'pos=VBZ': 1, 'next_chunk=I-NP': 1, 'prev_tok=eu': 1, 'next_pos=JJ': 1, 'tok=rejects': 1, 'chunk=I-VP': 1, 'prev_chunk=I-NP': 1, 'next_tok=german': 1, 'prev_pos=NNP': 1}, {'chunk=I-NP': 1, 'next_tok=call': 1, 'prev_chunk=I-VP': 1, 'is_caps': 1, 'prev_tok=rejects': 1, 'next_chunk=I-NP': 1, 'tok=german': 1, 'next_pos=NN': 1, 'prev_pos=VBZ': 1, 'pos=JJ': 1}]

Is this correct? I have double check my code still, I am not able to figure out where I am going wrong. Any hints why I am getting different shape of data? I am ignoring "-DOCSTART- -X- -X- O" and newline from the train and test data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/iit-cs585/assignments/issues/20, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_31tbeyzhMEYNu1qLdRG3jVUn_vWx0ks5rlCn-gaJpZM4Mamp5 .

-- Regards, Gagan

ranjeetkumar commented 7 years ago

Thanks Gagan, By including "-DOCSTART- -X- -X- O", the shape of training and testing are not matching again. Now, even rows value are different from the expected value. training data shape: (27989, 18733) testing data shape: (28138, 18733)

By doctest from a3.py, I am assuming we have to ignore that.

train_data = read_data('train.txt') train_data[:2] [[('EU', 'NNP', 'I-NP', 'I-ORG'), ('rejects', 'VBZ', 'I-VP', 'O'), ('German', 'JJ', 'I-NP', 'I-MISC'), ('call', 'NN', 'I-NP', 'O'), ('to', 'TO', 'I-VP', 'O'), ('boycott', 'VB', 'I-VP', 'O'), ('British', 'JJ', 'I-NP', 'I-MISC'), ('lamb', 'NN', 'I-NP', 'O'), ('.', '.', 'O', 'O')], [('Peter', 'NNP', 'I-NP', 'I-PER'), ('Blackburn', 'NNP', 'I-NP', 'I-PER')]]

I am suspecting something is wrong in feature dictionary. Those outputs of feature dictionary I got by turning all the parameters to true

benoit0192 commented 7 years ago

How do you get the count 27858 ? By removing empty lines and 'DOCSTART...' lines, I count 27867 entries.

For the second dimension, I believe you should check the make_features_dicts function. Maybe double check the 'context' flag.

ranjeetkumar commented 7 years ago

I have ignore "DOCSTART..." and empty lines. I think 27858 is correct. In log.txt also it is 27858 https://github.com/iit-cs585/assignments/blob/master/a3/Log.txt

I am checking the context flag, still not able to find anything. This is the slice of first three from feature dictionary from train.txt.

[{'chunk=I-NP': 1, 'pos=NNP': 1, 'next_chunk=I-VP': 1, 'tok=eu': 1, 'next_tok=rejects': 1, 'is_caps': 1, 'next_pos=VBZ': 1}, {'pos=VBZ': 1, 'next_chunk=I-NP': 1, 'prev_tok=eu': 1, 'next_pos=JJ': 1, 'tok=rejects': 1, 'chunk=I-VP': 1, 'prev_chunk=I-NP': 1, 'next_tok=german': 1, 'prev_pos=NNP': 1}, {'chunk=I-NP': 1, 'next_tok=call': 1, 'prev_chunk=I-VP': 1, 'is_caps': 1, 'prev_tok=rejects': 1, 'next_chunk=I-NP': 1, 'tok=german': 1, 'next_pos=NN': 1, 'prev_pos=VBZ': 1, 'pos=JJ': 1}]

Do you find anything wrong in this feature dictionary?

aronwc commented 7 years ago

@ranjeetkumar prev_is_caps=1 is missing from the second dict.

Also, I was missing the final sentence in my Log.txt. I've pushed an update.

benoit0192 commented 7 years ago

Yes everything's matching now.

ranjeetkumar commented 7 years ago

Thanks @aronwc @benoit0192 I too missed the final sentence and put extra "=" in is_cap that why problem was coming, Still, I am not able to get the exact answer from log.txt. My doubt in the calculation of the average F1 score.

Given the evaluation matrix, how we will calculate average F1 score ? From log, evaluation matrix: I-LOC I-MISC I-ORG I-PER O precision 0.745115 0.865672 0.673111 0.705882 0.972336 recall 0.729565 0.407733 0.377340 0.856041 0.990355 f1 0.737258 0.554361 0.483586 0.773744 0.981263

average f1s: 0.591735

Average(0.737258, 0.554361 , 0.483586, 0.773744) != 0.591

Is there, other method used for calculating average f1?

aronwc commented 7 years ago

My fault - fixed.

On Sun, Mar 12, 2017 at 8:08 PM, Ranjeet kumar notifications@github.com wrote:

Thanks @aronwc https://github.com/aronwc @benoit0192 https://github.com/benoit0192 I too missed the final sentence and put extra "=" in is_cap that why problem was coming, Still, I am not able to get the exact answer from log.txt. My doubt in the calculation of the average F1 score.

Given the evaluation matrix, how we will calculate average F1 score ? From log, evaluation matrix: I-LOC I-MISC I-ORG I-PER O precision 0.745115 0.865672 0.673111 0.705882 0.972336 recall 0.729565 0.407733 0.377340 0.856041 0.990355 f1 0.737258 0.554361 0.483586 0.773744 0.981263

average f1s: 0.591735

Average(0.737258, 0.554361 , 0.483586, 0.773744) != 0.591

In there, other method used for calculating average f1?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/iit-cs585/assignments/issues/20#issuecomment-285993724, or mute the thread https://github.com/notifications/unsubscribe-auth/ADv-hYgOSx4lhLWbO_0Vs--x0G_8USEmks5rlJcSgaJpZM4Mamp5 .

benoit0192 commented 7 years ago

@ranjeetkumar Have you been able to match the confusion matrix with the log file? I have one misclassified instance compared to the log...

_ I-LOC I-MISC I-ORG I-PER O
I-LOC 839 10 76 119 106
I-MISC 46 232 32 43 216
I-ORG 136 17 383 259 220
I-PER 57 3 37 1332 127
O 48 6 40:exclamation: 134 23515 :exclamation:

@aronwc According to the confusion matrix values, it would appear that isupper() function has been applied on the whole string instead of the first letter. As a result, 'The' would not be considered as is_caps. Is it okay?

# Here is what I mean
str = 'Hello'
str.isupper()       # output False
str[0].isupper()    # output True

Because of that, the F1 values are lower than they could be.

ranjeetkumar commented 7 years ago

@benoit0192 till now I am not able to get correct result. I am looking case by case. Out of sixteen configurations, I got right in two cases i.e. pos=True or chunk=True along with token=True and rest as False. In all true case mine confusion matrix is different. I am debugging into it.

ranjeetkumar commented 7 years ago

One more observation is that when I check the whole token as the upper case then it is matching with the log file result for f1. f1 nparams iscap pos chuck context 0.367003 30920 True False False False When I check only first character of the token is upper or not then I got the following result. f1 nparams iscap pos chuck context 0.637879 30920 True False False False Question, I guess asking about the first character of the token being upper or not.

ranjeetkumar commented 7 years ago

@aronwc @benoit0192 When I run with the simplest configuration with context enabled i.e. token=True, caps=False, pos=False, chunk=False, context=True

The slice of first 5 dicts of training data are {'tok=eu': 1, 'next_tok=rejects': 1} {'next_tok=german': 1, 'tok=rejects': 1, 'prev_tok=eu': 1} {'next_tok=call': 1, 'tok=german': 1, 'prev_tok=rejects': 1} {'prev_tok=german': 1, 'tok=call': 1, 'next_tok=to': 1} {'tok=to': 1, 'next_tok=boycott': 1, 'prev_tok=call': 1} The slice of last 5 dicts of training data are {'prev_tok=belgian': 1, 'next_tok=prix': 1, 'tok=grand': 1} {'tok=prix': 1, 'prev_tok=grand': 1, 'next_tok=practice': 1} {'next_tok=times': 1, 'prev_tok=prix': 1, 'tok=practice': 1} {'next_tok=.': 1, 'tok=times': 1, 'prev_tok=practice': 1} {'prev_tok=times': 1, 'tok=.': 1}

This look okay to me. If this is correct then the I guess results in log where context is enabled are wrong. The f1s value of this, I got 0.467399 and nparams is 92745 where as in log it is 0.467874 and 90550 respectively.

aronwc commented 7 years ago

@ranjeetkumar Right - I fixed the is_caps feature and reran.

I get the same dicts as in your example above.

ranjeetkumar commented 7 years ago

@aronwc dicts are going to same Dictvectorize, the result should come out same.

From another angle, As my result is matching except the cases where context is True Taking these from log.txt f1 n_params caps pos chunk context 0 0.330491 30915 False False False False 1 0.467874 90550 False False False True

Assuming following as true, f1 n_params caps pos chunk context 0 0.330491 30915 False False False False

Total number of unique feature of this = 30915/5 = 6183

Now, context and token both are true, rest other false. In this case, we are adding previous and next token Minimum unique features(In worst case scenario assuming all are unique tokens) The total number of unique features = 6183 x 3 - 2 = 18547 The maximum number of unique features = 6183x3 = 18549 [some overlap among features] Here 3 comes from (tok= , next_tok, prev_tok) and 2 by first and last dictionaries have length of 2 so it either contain prev_tok or next_tok along with tok

The range of nparam for context and token both are true, rest other false are min = 18547x5 = 92735 [5 is no of class] max = 18549x5 = 92745

The value 90550 does not fall into this range. This is my intuition and my result is matching with the range bound. Please correct me if I am going wrong.

aronwc commented 7 years ago

@ranjeetkumar To confirm: the context features should not cross sentence boundaries.

E.g., if sentence one is "A brown dog" and sentence two is "The black cat", the context features for the token "The" should not include "prev_tok=dog"

ranjeetkumar commented 7 years ago

@aronwc Thanks a lot! professor. I have included context features beyond cross sentence boundaries. I have removed that. Now my result is perfectly matching with the log.txt

changediyasunny commented 7 years ago

Hey Guys, I rechecked again all parameters and found to have same features shape as mentioned before. what is missing to get 27867 ?

total labels: 27858 training data shape: (27858, 18726) [ {'is_caps': 1, 'chunk=I-NP': 1, 'tok=eu': 1, 'next_pos=VBZ': 1, 'next_chunk=I-VP': 1, 'next_tok=rejects': 1, 'pos=NNP': 1}, {'pos=VBZ': 1, 'prev_tok=eu': 1, 'next_tok=german': 1, 'tok=rejects': 1, 'next_pos=JJ': 1, 'next_chunk=I-NP': 1, 'next_is_caps': 1, 'prev_chunk=I-NP': 1, 'prev_is_caps': 1, 'prev_pos=NNP': 1, 'chunk=I-VP': 1}, {'next_pos=NN': 1, 'is_caps': 1, 'next_chunk=I-NP': 1, 'tok=german': 1, 'chunk=I-NP': 1, 'prev_chunk=I-VP': 1, 'pos=JJ': 1, 'prev_pos=VBZ': 1, 'next_tok=call': 1, 'prev_tok=rejects': 1}

]

Regards, Sunny