knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

could not identify citations if text "References" not present #10

Closed renaud closed 11 years ago

renaud commented 11 years ago

I ran a quick evaluation of the citation extraction, and it seems to me that ParsCit will not extract citations if the word "References" is not present in the text. Unfortunately, I have several documents where this word is not present, but that have citations. Any chance to "relax" this, and identify citations even when "References" is not present in the text? Thanks, Renaud

knmnyn commented 11 years ago

Hi Renaud:

Great question. Actually a few people have asked about this, so it's good that you opened an issue for it. It's important to modify this, especially if you are using the system for processing other languages.

Currently we identify the beginning of the references section by using the regular expression on ParsCit/PreProcess.pm line 68:

if ($ln_content =~ m/\b(References?|REFERENCES?|Bibliography|BIBLIOGRAPHY|References?\s+and\s+Notes?|References?\s+Cited|REFERENCES?\s+CITED|REFERENCES?\s+AND\s+NOTES?|LITERATURE?\s+CITED?):?\s*$/)

We probably should factor this into a constant so that people can modify.

knmnyn commented 11 years ago

Oops, sorry, forgot to mention that we do need to limit where references appear so ParsCit does look for the marker (the detection being specified in the line above). If you want to modify it to look for strings wherever, you're welcomed to do that, but it is not the general functionality that most users want (so we wouldn't be incorporating it).

Hope that helps. Closing this issue.

renaud commented 11 years ago

Thank you Min-Yen, that helps a lot!