Closed tuh8888 closed 4 years ago
I'm having a lot of trouble just reading in my data. After running the articles through the dependency parser, I need to read the articles in. Currently, I am using Knowtator for this. The dependency parser (SyntaxNet trained on CRAFT) generates output in ConLL-U format that needs to be aligned with the reference text since ConLL-U format does not have token locations. The challenge is that the dependency parser also replaces the tokens with difference characters. Examples are below
Original | Replacement | Reason |
---|---|---|
( | -LRB- | Brace |
) | -RRB- | Brace |
{ | -LRB- | Brace |
} | -RRB- | Brace |
[ | -LRB- | Brace |
] | -RRB- | Brace |
… | ... | Unicode turned to three characters |
I believe that I have resolved the issues with reading in the format described in https://github.com/tuh8888/Dep2Rel/issues/3#issuecomment-490941987
:precision 0.2680434319778582 :recall 0.3679135008766803 :f1 0.3101367163443774
:fn | :f1 | :property | :precision | :tn | :tp | :recall | :fp |
---|---|---|---|---|---|---|---|
892 | 0.3955068 | CPR:4 | 0.34740707 | 12693 | 757 | 0.4590661 | 1422 |
433 | 0.2704443 | CPR:9 | 0.23076923 | 14421 | 210 | 0.32659408 | 700 |
139 | 0.20434782 | CPR:5 | 0.17153284 | 15351 | 47 | 0.25268817 | 227 |
2974 | 0.80037594 | NONE | 0.84648055 | 1723 | 9368 | 0.7590342 | 1699 |
159 | 0.30731708 | CPR:6 | 0.23551401 | 15070 | 126 | 0.44210526 | 409 |
540 | 0.16323732 | CPR:3 | 0.14893617 | 14425 | 119 | 0.18057664 | 680 |
{:context-path-length-cap 100 :context-thresh 0.9 :cluster-thresh 0.95 :min-match-support 0 :max-iterations 100 :max-matches 3000 :re-clustering? true}
Note that the baseline best from the competition (found here on page 86) are: :precision 0.4544 :recall 0.5387 :f1 0.3729
@LEHunter
With only 100 samples
:precision 0.14155157012299868, :recall 0.6019871420222093, :f1 0.22920723226703754
{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :min-pattern-support 1, :max-iterations 100, :cluster-thresh 0.75, :max-matches 5000, :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 100 :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x77c68044 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@77c68044"]}
EDIT: This actually doesn't count because the sentences with context paths out of range were excluded from evaluation, artificially boosting the score.
Beat the SotA!
:precision 0.31506377111340916 :recall 0.4913978494623656 :f1 0.38395295106070154}
{:rng 0.022894, :seed-frac 1, :match-thresh 0.5, :context-path-length-min 3, :min-pattern-support 0, :max-iterations 0, :cluster-thresh 0.7, :max-matches 5000, :match-fn #object[edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match 0x254c8de8 "edu.ucdenver.ccp.nlp.relation_extraction$support_weighted_sim_distribution_context_match@254c8de8"], :confidence-thresh 0, :re-clustering? true, :context-path-length-cap 6}
I need to find a way to keep track of what pattern combinations I use. The above comment has its own commit, but it would be better to output as a variable if possible.
I just realized that I had been using the wrong table for the state of the art performance on this task. The actual state of the art metrics are precision: 0.7437 recall: 0.6784 f-score: 0.6410
Also, there are some mistakes in the gold_standard file. There are 35 with different relations compared to the relations_gs file. For example, this line in the gold_standard file:
1978244 | CPR:6 | Arg1:T21 | Arg2:T35
when the relation is actually DIRECT_REGULATOR (a CPR:2 relation)
The conference proceedings. Track 5 starts on pg 141
BioCreative 2017 had a subtask (Track5) involving extracting chemical-protein interactions from text.
Needed Scripts: Write a script that does the following:
Inputs:
Action:
Output: what is the desired output?