tuh8888 / Dep2Rel

Relation extraction using word embeddings and dependecy paths

GNU General Public License v3.0

1 stars 0 forks source link

Coding: BioCreative 2017 Track 5 relation extraction and evaluation #3

Closed tuh8888 closed 4 years ago

tuh8888 commented 5 years ago

BioCreative 2017 had a subtask (Track5) involving extracting chemical-protein interactions from text.

Needed Scripts: Write a script that does the following:

[x] Parses BioCreative inputs
[x] Extracts chemical-protein relations
[ ] Outputs to BioCreative output
[x] Documents process and performance

Inputs:

[x] BioCreative articles
[x] Seeds #8

Action:

[x] Parses BioCreative inputs

Output: what is the desired output?

[x] Extracted relations
[x] Model performance #9

tuh8888 commented 5 years ago

I'm having a lot of trouble just reading in my data. After running the articles through the dependency parser, I need to read the articles in. Currently, I am using Knowtator for this. The dependency parser (SyntaxNet trained on CRAFT) generates output in ConLL-U format that needs to be aligned with the reference text since ConLL-U format does not have token locations. The challenge is that the dependency parser also replaces the tokens with difference characters. Examples are below

Original	Replacement	Reason
(	-LRB-	Brace
)	-RRB-	Brace
{	-LRB-	Brace
}	-RRB-	Brace
[	-LRB-	Brace
]	-RRB-	Brace
…	...	Unicode turned to three characters

tuh8888 commented 5 years ago

I believe that I have resolved the issues with reading in the format described in https://github.com/tuh8888/Dep2Rel/issues/3#issuecomment-490941987

tuh8888 commented 5 years ago

Results

Overall

:precision 0.2680434319778582 :recall 0.3679135008766803 :f1 0.3101367163443774

Per group

:fn	:f1	:property	:precision	:tn	:tp	:recall	:fp
892	0.3955068	CPR:4	0.34740707	12693	757	0.4590661	1422
433	0.2704443	CPR:9	0.23076923	14421	210	0.32659408	700
139	0.20434782	CPR:5	0.17153284	15351	47	0.25268817	227
2974	0.80037594	NONE	0.84648055	1723	9368	0.7590342	1699
159	0.30731708	CPR:6	0.23551401	15070	126	0.44210526	409
540	0.16323732	CPR:3	0.14893617	14425	119	0.18057664	680

Params

{:context-path-length-cap 100 :context-thresh 0.9 :cluster-thresh 0.95 :min-match-support 0 :max-iterations 100 :max-matches 3000 :re-clustering? true}

tuh8888 commented 5 years ago

Note that the baseline best from the competition (found here on page 86) are: :precision 0.4544 :recall 0.5387 :f1 0.3729

tuh8888 commented 5 years ago

@LEHunter

tuh8888 commented 5 years ago

Results

With only 100 samples

Overall

:precision 0.14155157012299868, :recall 0.6019871420222093, :f1 0.22920723226703754

Params

{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :min-pattern-support 1, :max-iterations 100, :cluster-thresh 0.75, :max-matches 5000, :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 100 :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x77c68044 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@77c68044"]}

tuh8888 commented 5 years ago

EDIT: This actually doesn't count because the sentences with context paths out of range were excluded from evaluation, artificially boosting the score.

Results

Beat the SotA!

Overall

:precision 0.31506377111340916 :recall 0.4913978494623656 :f1 0.38395295106070154}

Params

{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :context-path-length-min 3, :min-pattern-support 0, :max-iterations 0, :cluster-thresh 0.7, :max-matches 5000, :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x254c8de8 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@254c8de8"], :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 6}

Notes

Combined two groups of patterns: those with a seed recall above 0.9 and those with a seed precision of 0.7
I noticed that confining the context pattern length range helped a lot, almost 7% improvement

tuh8888 commented 5 years ago

I need to find a way to keep track of what pattern combinations I use. The above comment has its own commit, but it would be better to output as a variable if possible.

tuh8888 commented 5 years ago

I just realized that I had been using the wrong table for the state of the art performance on this task. The actual state of the art metrics are precision: 0.7437 recall: 0.6784 f-score: 0.6410

Also, there are some mistakes in the gold_standard file. There are 35 with different relations compared to the relations_gs file. For example, this line in the gold_standard file:

1978244 | CPR:6 | Arg1:T21 | Arg2:T35

when the relation is actually DIRECT_REGULATOR (a CPR:2 relation)

tuh8888 commented 5 years ago

The conference proceedings. Track 5 starts on pg 141