tuh8888 / Dep2Rel

Relation extraction using word embeddings and dependecy paths
GNU General Public License v3.0
1 stars 0 forks source link

Coding: BioCreative 2017 Track 5 relation extraction and evaluation #3

Closed tuh8888 closed 4 years ago

tuh8888 commented 5 years ago

BioCreative 2017 had a subtask (Track5) involving extracting chemical-protein interactions from text.

Needed Scripts: Write a script that does the following:

Inputs:

Action:

Output: what is the desired output?

tuh8888 commented 5 years ago

I'm having a lot of trouble just reading in my data. After running the articles through the dependency parser, I need to read the articles in. Currently, I am using Knowtator for this. The dependency parser (SyntaxNet trained on CRAFT) generates output in ConLL-U format that needs to be aligned with the reference text since ConLL-U format does not have token locations. The challenge is that the dependency parser also replaces the tokens with difference characters. Examples are below

Original Replacement Reason
( -LRB- Brace
) -RRB- Brace
{ -LRB- Brace
} -RRB- Brace
[ -LRB- Brace
] -RRB- Brace
... Unicode turned to three characters
tuh8888 commented 5 years ago

I believe that I have resolved the issues with reading in the format described in https://github.com/tuh8888/Dep2Rel/issues/3#issuecomment-490941987

tuh8888 commented 5 years ago

Results

Overall

:precision 0.2680434319778582 :recall 0.3679135008766803 :f1 0.3101367163443774

Per group

:fn :f1 :property :precision :tn :tp :recall :fp
892 0.3955068 CPR:4 0.34740707 12693 757 0.4590661 1422
433 0.2704443 CPR:9 0.23076923 14421 210 0.32659408 700
139 0.20434782 CPR:5 0.17153284 15351 47 0.25268817 227
2974 0.80037594 NONE 0.84648055 1723 9368 0.7590342 1699
159 0.30731708 CPR:6 0.23551401 15070 126 0.44210526 409
540 0.16323732 CPR:3 0.14893617 14425 119 0.18057664 680

Params

{:context-path-length-cap 100 :context-thresh 0.9 :cluster-thresh 0.95 :min-match-support 0 :max-iterations 100 :max-matches 3000 :re-clustering? true}

tuh8888 commented 5 years ago

Note that the baseline best from the competition (found here on page 86) are: :precision 0.4544 :recall 0.5387 :f1 0.3729

tuh8888 commented 5 years ago

@LEHunter

tuh8888 commented 5 years ago

Results

With only 100 samples

Overall

:precision 0.14155157012299868, :recall 0.6019871420222093, :f1 0.22920723226703754

Params

{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :min-pattern-support 1, :max-iterations 100, :cluster-thresh 0.75, :max-matches 5000, :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 100 :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x77c68044 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@77c68044"]}

tuh8888 commented 5 years ago

EDIT: This actually doesn't count because the sentences with context paths out of range were excluded from evaluation, artificially boosting the score.

Results

Beat the SotA!

Overall

:precision 0.31506377111340916 :recall 0.4913978494623656 :f1 0.38395295106070154}

Params

{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :context-path-length-min 3, :min-pattern-support 0, :max-iterations 0, :cluster-thresh 0.7, :max-matches 5000, :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x254c8de8 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@254c8de8"], :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 6}

Notes

tuh8888 commented 5 years ago

I need to find a way to keep track of what pattern combinations I use. The above comment has its own commit, but it would be better to output as a variable if possible.

tuh8888 commented 5 years ago

I just realized that I had been using the wrong table for the state of the art performance on this task. The actual state of the art metrics are precision: 0.7437 recall: 0.6784 f-score: 0.6410

Also, there are some mistakes in the gold_standard file. There are 35 with different relations compared to the relations_gs file. For example, this line in the gold_standard file:

1978244 | CPR:6 | Arg1:T21 | Arg2:T35

when the relation is actually DIRECT_REGULATOR (a CPR:2 relation)

tuh8888 commented 5 years ago

The conference proceedings. Track 5 starts on pg 141