openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
649 stars 146 forks source link

testing model on a single sentence #15

Closed karimvision closed 6 years ago

karimvision commented 6 years ago

Hello, It would be great if you could post an example code to run the model for the example mentioned in the blog post (see below), with and without types, would be really helpful, thanks! I have the model trained already :)

The man saw a Jaguar with types Jaguar Cars 0.70 jaguar 0.12

without types Jaguar Cars 0.60 jaguar 0.29

karimvision commented 6 years ago

class SequenceTagger in train_type seems to provide the probabilities, still an example code from you would be really helpful :)

JonathanRaiman commented 6 years ago

Support supportive of this idea :) I will try to construct a jupyter notebook example to make this easier to do. Happy to accept a PR that can run the model you've trained too!

karimvision commented 6 years ago

@JonathanRaiman hey, thanks! i'm working on the PR right now :)

karimvision commented 6 years ago

@JonathanRaiman the model is too big to upload here, i have created a PR https://github.com/openai/deeptype/pull/18 which has the code to get type probabilities from an example sentence, I would really appreciate if you could try it using your model and let me know your feedback and improvements, thanks!

JonathanRaiman commented 6 years ago

Ok will do

karimvision commented 6 years ago

@JonathanRaiman some more questions, sorry about the trouble, it would be useful if you could let us know how you implemented this "playing the 20 questions" thing mentioned in the paper and as of the types that is mentioned in the examples (figure 1) from the paper, it would be great to know how you came up with those types, the types i'm using came from the evolve_type_system and the classifiers inside the extraction/classifiers folder, apologies if i'm repetitive, thanks!

JonathanRaiman commented 6 years ago

@karimvision Thank you for your interest. Just confirmed the notebook works for me too.

Concerning the types that are used, there are a couple human classes that are chosen manually (e.g. like those in the classifiers folder), same as you loaded in the notebook. There is an additional dimension that focuses on topic that you can create by choosing coarse granularity domains like sports, entertainment, politics, etc...

You can also separately run evolve_type_classifiers and use the resulting binary types there to do disambiguation. (I didn't try mixing and matching human + evolved, but hopefully with tuning they could be complimentary :)

The scoring process is found in equation (6) from https://arxiv.org/abs/1802.01021, and can be written in Python as:

import marisa_trie
from wikidata_linker_utils.offset_array import OffsetArray

language_path = "/path/to/en_trie"

trie_index2indices_values = OffsetArray.load(
    join(language_path, "trie_index2indices")
)
trie_index2indices_counts = OffsetArray(
    np.load(join(language_path, "trie_index2indices_counts.npy")),
    trie_index2indices_values.offsets
)

trie = marisa_trie.Trie().load(
    join(language_path, "trie.marisa")
)

# now do scoring of an entity using the intra-wiki links:

# keep only items that are more than 1% likely
min_prob = 0.01
anchor = trie.get("jaguar")
indices = trie_index2indices_values[anchor]
link_probs = trie_index2indices_counts[anchor]
link_probs = link_probs / link_probs.sum()
mask = link_probs > min_prob
indices = indices[mask]
# these probs contain likely guesses, but do not reflect the context
link_probs = link_probs[mask]

# now pick a smoothing value (e.g. bayesian view that the type_belief model should be ignored)
alpha_type_belief = 0.5
# pick a smoothing value that all types should be ignored:
beta = 0.99
# this is the get_probs function from the notebook you added:
model_probs = model(sentence)
# extract the probs at the location of interest, e.g.
token_location = sentence.find("jaguar")
# something like this for each type dimension,
# where model_probs is a dict with numpy arrays as values that
# have shape Time x Batch x Number of type classes:
type_belief = model_probs["type"][token_location, 0, :]

# recover the assignment for each index you care about for type
type_oracle = load_oracle_classification("/path/to/exported/type_classification")
assignments = type_oracle.classify(indices)
type_probs = type_belief[assignments]
type_probs = alpha_type_belief * type_probs + (1.0 - type_probs)

# repeat this process for all other type dimensions you have access to, e.g. location, time, etc.
full_score = link_probs * (1.0 - beta + beta * type_probs)

index = full_score.argmax()
top_pick = indices[index]
karimvision commented 6 years ago

@JonathanRaiman thank you so much for your feedback and the example code, i will try it and let you know :)

karimvision commented 6 years ago

@JonathanRaiman i tried the code above, i guess there is a small bug, fixed it by changing

type_probs = alpha_type_belief * type_probs + (1.0 - type_probs)

to

type_probs = alpha_type_belief * type_probs + (1.0 - alpha_type_belief)

as it correctly represents the equation 6 in the paper

screen shot 2018-02-27 at 1 09 06 pm

please let me know if i'm wrong

and just to confirm, the example code you gave, is taking mention 'jaguar' and its entities but not the context, so the final entity process involves taking the context words (in a sentence) too right? is the logic used in the example code, the final step in entity linking?

thanks!

JonathanRaiman commented 6 years ago

Yep, you are correct.

Concerning the example code: you take the full sentence/paragraph/doc containing "jaguar", run the model on that and extract the probabilities at the location of the token(s) you want to disambiguate. [e.g. context words are only useful insofar as they inform the RNN of the context, but the belief over types at other points in the document does not influence the prediction for "jaguar")

ghost commented 6 years ago

Thanks @JonathanRaiman ! I could solve full_score, but I still don't understand what "indices" is. How to find the entity name ( such as "Jaguar_Cars" ) from indices and top_pick?

ghost commented 6 years ago

I self-solved it. For finding the entity name from indices, I just used data/wikidata/wikidata_wikititle2wikidata.tsv like this:

from collections import defaultdict
from tqdm import tqdm
out = defaultdict(list)
with open("../data/wikidata/wikidata_wikititle2wikidata.tsv") as f:
    for line in tqdm(f):
        it = line.replace('\n', '').split('\t')
        out[int(it[1])].append(it[0])
out[top_pick]

Note: Indices are defined on the second column of wikidata_wikititle2wikidata.tsv, but I couldn't use wikititle2wikidata.marisa because of encoding error.

JonathanRaiman commented 6 years ago

Great! Happy to help with any other index related weirdness. Closing for now :)