knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

ParsCit false positive case in citations #13

Closed lichili233 closed 10 years ago

lichili233 commented 10 years ago

Hi,

I have noticed that the way ParsCit detects the end of the citation block causes false positive results sometimes:

I have a paper that turns out to have the word "Notes" (notice the capital "N") at the beginning of a line, while this word was meant to be a part of a cited title, and not as an indicator of a foot-note section or something alike. ParsCit think this was the end of the citations block and ignored numerous citations following behind. I believe this error relates to the code in PreProcess.pm, in line 313.

It is a minor probabilistic case but just so that you know.

knmnyn commented 10 years ago

Hi Lichi, thanks for your problem note. With probabilistic models there will be many such cases. You can re-train ParsCit with positive training instances where "Notes" is given a proper tagging and perhaps the model will take from those examples to learn the correct tagging for this.

However, you pointed out that it may be due to PreProcess.pm, a line outside of the probabilistic model and in the rule-based heuristics for ParsCit, before the lines enter the CRF model for tagging. We'll take a look and see whether there's a possibility to fix this issue.

junkmechanic commented 10 years ago

Hi Lichi, Thanks for your report. We are looking into this issue now and would possibly try to fix it.

junkmechanic commented 10 years ago

Hi Lichi, Could you please confirm if the input that you had tried with was a txt file or an xml file? Thanks

lichili233 commented 10 years ago

Hi Ankur,

Sorry for the late reply. I had the issue on a txt document. The content text was originally extracted from a PDF academic paper, and the rough format was kept in the txt to an extent. So the "Notes" was there at the beginning of the line in the original PDF, and kept the same way in the txt I applied ParsCit on. So if I changed it to "notes" (not capitalized) in that line of this txt it will work completely without any issue.

P.S: I might not have understand ParsCit thoroughly, and I related this issue to PreProcess.pm just by skimming through the code very fast. It was where I guess the problem most likely came from.

Hopes this clarify things somewhat.

Thanks

junkmechanic commented 10 years ago

Hi Lichi, Thanks for the info. That confirmed the underlying problem with this issue. We have created a fix for this which has been committed to the 'dev' branch. Request you to clone this repo again or just pull all the new changes, then change to the dev branch using 'git checkout', and try the input that you had used earlier. Please let us know if you find any more problems. Thanks