Uploading final disc results

The discriminator model uses the output of the generative model to make sentence classifications. In other words-> the generative model gives each sentence a confidence score that is the likelihood of mentioning a relationship. Then the discriminator model adds sentences features on top of the generative model's output to theoretically improve the final confidence score. In a perfect world the discriminator should out perform the generator, but as you can see this process is quite messy.
That's a really great question. Not sure the concrete answer there, but my main hunch is that class imbalance plays an important role with performance. CbG and GiG only have about less than 10% of positive sentences labeled, while DaG has about 50-50 split. Another pitfall is that the evaluation sets could use more labeled examples. Ideally I should have about 1k sentences labeled for each relationship, but given the circumstances that is not a trivial thing to do. Lastly, some relationships are easier to predict than others. It could be that DaG and CtD are easier to detect than GiG and CbG.

greenelab / snorkeling