Train in a custom dataset

openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"

https://arxiv.org/abs/1802.01021

Other

647 stars 147 forks source link

Train in a custom dataset #55

Closed iuria21 closed 4 years ago

iuria21 commented 4 years ago

Hi, first thanks for your repo and the docs. I want to use the library for the following case and want to know if you think it could work: I have legal text with labeled references to laws and codes for these laws. I want to train the model to link the reference of the law with it's code.

I understand that I'll have to create a dataset like:

{"id": "doc1",
 "text": ".. as the Mortgage Law says ..."
"links" : [{"start":6 ,
             "stop":17,
             "target" "LW101"}]}

would it be possible to train with only this features?

Thanks again!

StudyExchange commented 4 years ago

Can you train this project at standart dataset CoNLL (YAGO) and the TAC KBP 2010 challenge now?

JonathanRaiman commented 4 years ago

@basque21 are you trying to create a model trained just for a subset of the possible entities? Currently DeepType first does classification among "types" (e.g. person, animal, etc.) and you can switch that out for a different family of types. Then disambiguation uses the strings in the text to find matches in a KB (stored as a trie). If you want only to match against "Mortage Law" for instance, then making sure that KB contains just Mortage Law is the way to go.

JonathanRaiman commented 4 years ago

Can you train this project at standart dataset CoNLL (YAGO) and the TAC KBP 2010 challenge now?

Currently DeepType is trained without a particular dataset (just using Wikipedia), and then you run it on the evaluation sets for TAC KBP and others. This is how the paper results were obtained. Not sure if that was your question?

iuria21 commented 4 years ago

@JonathanRaiman hi, I just realized that this model doesn't fit for my problem as my disambiguation can't be done finding string in the text in the KB, so I changed my point of view.