knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

France gets coded as month #26

Closed cmkumar87 closed 5 years ago

cmkumar87 commented 7 years ago

Bug report as received via email. " I am at the stage where I am looking at the dictionaries in Sectlabel. I am still writing to you since I don't have a mailing list yet ... I was hoping to open that with your collaboration.

My aim to restructure ParsCit so it can work with a wider range of data that can be furnished without changing the Perl code. The dictionaries are of particular interest.

I have rewritten the dictionaries code. I have assumed that the probabilities in the dictionary files are not used. At least SectLabel does not appear to be using them. My alternative approach is to split the monolithic dictionaries by type. I merged Chinese and last names. Example

cec@evstu:~/var/dict/token$ head -4 male.txt ## Male First Names from lots of languages ## source ftp://ftp.funet.fi/pub/doc/dictionaries/DanKlein/ ## aaron

I can then say if a token is in a dictionary, it gets a features by the same of the dictionary, otherwise no.

I suspect that your approach aims at the same, even though you have one file. But here is the problem. When I implement my approach with an unchanged---just split into files---dictionary, I get a different result than your code. I tracked down the issue on what I suspect is a bug in ParsCit. As illustrated in the attached code, from the way the dictionary currently structured, France is a month. Note that I believe the attached code only makes whitespace changes to your original.

I don't see an easy way to fix that without actually breaking the number coding inside the dictionary. Since I don't understand why this numbering system was used, I am reluctant to do that." france_month.txt

cmkumar87 commented 7 years ago

Reply from @knmnyn "You are right, ParsCit as a whole just uses token identity; we did not attempt yet to use any probabilities. The numbering system was designed to allow tokens to participate in more than one identity (e.g., "Boston" as a location and as part of a publishers' name). I'm not sure why France was coded as a month, but I didn't see the specific code in your excerpt that causes this."

cmkumar87 commented 5 years ago

Closing. We aren't developing parscit anymore. Please use neural version of the parser here. https://github.com/WING-NUS/Neural-ParsCit. This is a theano based version. We are actively developing a pytorch based version which we will release soon.