mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Bug or deadcode in SelectiveSGML2TokenSequence.java #187

Open thibolu opened 3 years ago

thibolu commented 3 years ago

Hello, I noticed a potential bug in src/cc/mallet/pipe/SelectiveSGML2TokenSequence.java

On lines 92 and 93 of the file, we have:

nextTag = m.group(0);     
nextTag = sgml.substring(1, sgml.length()-1);

I don't have domain knowledge about this algorithm, but it looks suspicious. I believe that either one of the nexTag is incorrect (maybe it should be nextStart?) or if it's not a bug, line 92 is dead code and should be removed to avoid confusion.