sameersingh / mf

Apache License 2.0
1 stars 2 forks source link

Compute embeddings for TAC test data #3

Open lvilnis opened 10 years ago

lvilnis commented 10 years ago

This should be pretty much the same as getting embeddings for train data - is the issue that train data is distant supervision-y and aggregated across a bunch of instances?

lvilnis commented 10 years ago

Or is it that patterns aren't features and we have patterns for the test data?

sameersingh commented 10 years ago

The problem is that we're working in the inductive setting, so we don't have access to test data when we're learning the model. Thus, for "test entities", we don't have an embedding, and therefore, cannot predict for them directly.

What we do have are the features and patterns at test time. So, just so that we can compare against a fair universal schema baseline, we need to predict test relations. Computing an embedding for the test entities somehow is one option. Options:

  1. Assume everything else is fixed, solve universal schema objective on each test entity
  2. Take average of the test pattern embeddings as the test entity embedding (same as above?)
  3. Take the test patterns, and see which relations are closest to their embedding:
    • find relation that is embedded closest to the mean of the test pattern embeddings
    • each patterns votes for one or two relations, and we compute which label was highest
  4. Others...?

Once we data loaded in and train embeddings learned, implementing the above approaches will be straightforward. We can then stream in the test sentences/entities, and evaluate which of these works best