kanerv / TuukkaSaanaKanerva

Course project for building NLP applications
0 stars 0 forks source link

[M] Fix verbs "other 0%" bug. #65

Closed TuukkaOT closed 3 years ago

TuukkaOT commented 3 years ago

So, I tried a bit of debugging by adding some print commands in the plot functions. I found that the "verbs = []" list in the verb plot function after searching for "act" has only 4 elements but it shows 9 verbs + "other 0%". I tried to figure out why, but because I didn't write any of the code in that function, I have no idea how to pull it apart. debugging

kanerv commented 3 years ago

The listverbs is only used inside the first for loop that goes through the instances in graph_matches. Therefore, the content of verbsis rewritten for every document in graph_mathces. When you print the content of verbsat the end of the whole function, it only contains the verbs that appear in the last document (i.e. content) of graph_matches

So, the four verbs (see, continue, create, hit) printed in your experiment, do actually match the last document! The same goes for the adjectives.

Screenshot 2021-03-06 at 13 17 40

Unfortunately, I don't know what's wrong either. If we cannot figure this out by Monday, I might just contact Raul and attend the last extra session 😅

kanerv commented 3 years ago

I was just going to quickly fix the debugging printing before my next meeting, but I found the bug! (and a few other little bugs) and I think I fixed them!

The problem with Other 0% was a typo in for adj in ranked_adjectives[9:]: remaining_adj_count += dist_dict[adj]

I had a 10 as the index for the ranked_adjectives, but indexing starts at zero, so the 10th item is at the index 9, not 10 😅

I also realized that the Other wasn't counted properly. The +=was the wrong way around =+, when it just assigned the last value of the last ranked adjective as the sum of all Other adjectives. This is now fixed as well for both adjectives and verbs.

Please do test this if you have time! My next meeting starts in 1min 😅

TuukkaOT commented 3 years ago

Awesome! I did some testing too and it seems to be working.

kanerv commented 3 years ago

Well... Now that I also fixed the count of Other variable the pie plots get messy when there are more hits 😢

Screenshot 2021-03-06 at 16 08 08
kanerv commented 3 years ago
Screenshot 2021-03-06 at 16 37 44

I experimented if we would turn the pie plots into bar plots when there are over 50 different tags (i.e., need to decide a point when the pie labels turn unreadable), but I don't really like the look of this so I haven't pushed the changes... I wonder if there is an easy way to detach the pie labels from the pie and just have them on the side to refer to colors so something. An easy way out would just be excluding the Other variable all together, but I wonder if that's cheating the system...

kanerv commented 3 years ago

Ok, I need your help.

To avoid the situation, where adjective and verb plots turn unreadable, I did two possible solutions. The problem appears when there are a lot of different adjectives that the Other section overgrows the rest of the pieplot.

Version 1: In the adjective plot I've just left the Other variable out of the plot when there are more than 10 different adjectives. I've added a notation in the title of the plot that indicates that only 10 out of x total is showed. However, the shape of the round pie plot is misleading in this situation. Also, there are a lot of adjectives that are mentioned only once. Some of them make it to the plot and some don't. The adjectives are in order of appearance in the texts sorted by relavance_search(), so I guess the first ones could be considered somewhat more relevant?

Version 2: In the verb plot is more like the previous version, where the Other variable is shown when there are more than 10 different adjectives. However, if there are more than 30 adjectives, the pie plot is not readable anymore, so the figure will be generated as a bar plot, which then also shows the Other variable.

Here's a screenshot to demonstrate:

Screenshot 2021-03-06 at 18 46 04

Which one do we like more @saanahyttinen @TuukkaOT? Or do you have better suggestions? I'm really gutted this turned out to be such a problem at the last meters. 😞

kanerv commented 3 years ago

OR should we just ignore the whole situation of labels getting messy? That would be the most truthful representation I guess...

saanahyttinen commented 3 years ago

Awesome that you found the bug and fixed it!! I also think the pie plots look nicer than the bar plot in this case, so good choice. :)