nlplab / nersuite

http://nersuite.nlplab.org/
Other
26 stars 12 forks source link

Duplicate IDs in standoff output #9

Closed spyysalo closed 12 years ago

spyysalo commented 12 years ago

When run with the -o standoff option, NERsuite output contains duplicate IDs (within a single input document). For example (for an AnEM model):

$ cut -f 2- featurized/multiclass-withmm/test/AnEM.test | nersuite tag -o standoff -m models/test.multiclass.withmm.model | head
32  48  entity_name id="entity-1" type="Pathological_formation"
68  84  entity_name id="entity-2" type="Pathological_formation"
226 231 entity_name id="entity-3" type="Pathological_formation"
378 389 entity_name id="entity-1" type="Cell"
408 413 entity_name id="entity-4" type="Pathological_formation"
429 445 entity_name id="entity-1" type="Multi-tissue_structure"
450 461 entity_name id="entity-2" type="Multi-tissue_structure"
[...]

Entity IDs should preferably be unique for each input document.

priancho commented 12 years ago

Currently, entity IDs are separately managed for each semantic type. (just a C++ map container :-)

The output above shows that you used the "standoff" option, not the "brat" option for output. Do you think that using unique IDs regardless of their semantic types is necessary for only the "brat" option, or all output formats?

spyysalo commented 12 years ago

I think unique IDs would be a benefit for all output options. Miwa-san is currently planning to use NERsuite in an extraction pipeline using the "standoff" output format and would hope to be able to avoid duplicate IDs without running a separate script, if possible.

priancho commented 12 years ago

While I am trying to add this functionality today, I found that Sampo added this already.

spyysalo commented 12 years ago

@priancho : are you sure? If you're referring to 2775f229429343edeee61f56bac1ab5cedcbf91f, that appears to apply to brat-flavored standoff only.

priancho commented 12 years ago

Oh, sorry about my mistake. I am now working on this. brat output option will use unique entity IDs regardless of its semantic types soon :-)

priancho commented 12 years ago

Now the brat output option (-o brat) generates unique IDs for all entities regardless of their semantic types. It also counts the IDs in document level, whereas other options (-o conll, -o standoff) still use IDs in sentence level.

spyysalo commented 12 years ago

@priancho : thanks, but I think this issue actually applies to the -o standoff option, not to the -o brat one. From the original:

When run with the -o standoff option, NERsuite output contains duplicate IDs (within a single input document).

priancho commented 12 years ago

Hi, sorry for my mistake. I applied the same functionality for the standoff format output :-)

spyysalo commented 12 years ago

Great, thanks!

S

On Mon, Jun 25, 2012 at 7:29 PM, Han-Cheol Cho < reply@reply.github.com

wrote:

Hi, sorry for my mistake. I applied the same functionality for the standoff format output :-)


Reply to this email directly or view it on GitHub: https://github.com/nlplab/nersuite/issues/9#issuecomment-6544134