Closed paul-tqh-nguyen closed 4 years ago
This commit implements idea 1.
While working on this, we learned one potential reason that fixing the threshold at 0.5 wouldn't be optimal.
Given the cross-entropy loss we use, costs get really bad when we steer really far away from the correct answer. The closer we get to the correct answer, the smaller those costs get. Thus, we'll always be focused on getting the really bad guesses close, not making the okay guesses closer.
How low the costs get doesn't really matter when it comes time to discretize the answer since low costs don't guarantee that the answer exceeds the threshold.
Let's say our model predicts a bunch of values close to 0.5. If there are some egregious predictions that are WAY OFF, then those egregious examples will be focused on by the gradient descent and those losses will decrease dramatically. The close examples won't be focused on by the gradient descent. At the end of the day, neither will be at the correct side of the threshold.
We picked a threshold hold to be the midpoint between the arithmetic mean of the intended-to-be-positive results given by our model and the intended-to-be-negative results given by our model for each topic / class.
This yielded approximately a 0.025 improvement in our F1 score.
We were able to get a better F1 score while having a bigger loss than before. This strongly hints that the problem we posed above was quite problematic.
FUTURE WORK:
Our results are below.
Our F1 metric was not implemented correctly. We got the definitions of recall and precision switched (this leads to the same behavior though; this isn't the egregious part) and we also implemented them incorrectly by summing along the wrong dimension (this is the egregious part). Our F1 numbers we were using to measure everything so far were COMPLETELY off by orders of magnitude. We were off by 2 orders of magnitude. Our F1 scores are actually TERRIBLE.
Also, our F1 threshold optimization was also implemented incorrectly since we were also summing along the incorrect dimension.
Luckily, it did prove to improve performance! However, an F1 increase of about 0.00025 on an F1 score like 0.004 (which is about what we're getting) isn't a great improvement. It did seem like a good idea. It might fall in the noise.
We added a bunch of assert statements to make the dimension checking more explicit.
Here are the new results we got:
We just reinvestigated this with the convolutional network.
These were our results:
It performs ever so slightly worse.
BCE + Soft F1 doesn't seem to get a noticeable improvement either.
Let's just hold off on this idea for now
This was discovered while working on #19.
We blindly assume that a 0.5 threshold (which we get from
torch.round
) is a good threshold.Let's not assume that (since it's quite presumptuous) and see if we can change the threshold to get an optimal result.
The two ideas we had so far:
(1+(raw_result-0.5))**2
orexp(1+(raw_result-0.5))
to the cost. Let's try this idea first if we go this route.TODO: