knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

Heuristics for journal training / ParsHed does not work #4

Closed adibaba closed 12 years ago

adibaba commented 12 years ago

Hello Min,

we have trained ParsCit for extracting references of some journals. One result is a set of heuristics for a training: http://pgknowaan.wordpress.com/2011/09/18/how-to-train-parscit-for-scientific-journals/


The training for ParsHed does not work :-\ We tried to use the code from http://aye.comp.nus.edu.sg/parsCit/#faq and use it like this:

#!/bin/bash
cd /opt/ParsCit/crfpp/traindata
../../bin/tr2crfpp.pl tagged_headers.txt > parsHed.train.test.data
../crf_learn parsHed.template parsHed.train.test.data parsHed.test.model
mv parsHed.test.model ../../resources/parsHed/parsHed.test.model

There are problems in the CRF++ process:

# Copyright 2005 � by Min-Yen Kan

CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2008 Taku Kudo, All rights reserved.

reading training data: tagger.cpp(162) [feature_index_->buildFeatures(this)] feature.cpp(154) [apply_rule(&os, *it, cur, *tagger)]  format error: U01:%x[0,23]
0.01 s

How can we fix this?

Best regards, Adrian

adibaba commented 12 years ago

The modified code above was taken from http://aye.comp.nus.edu.sg/parsCit/#t

'How about retraining ParsCit for another language/domain?'

    $ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
    $ ../crf_learn parsCit.template parsCit.train.data model
    $ mv model ../../resources/parsCit.model
adibaba commented 12 years ago

To train ParsHed, you should use

bin/parsHed/tr2crfpp_parsHed.pl -in tagged_headers.txt -out parsHed.train.test.data
knmnyn commented 12 years ago

Hi Adrian,

Thanks for your bug report. I've asked Huy, our current RA, to look into these problems. We'll hopefully be able to get back to you soon.

Cheers,

Min

Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

On Thu, Oct 6, 2011 at 4:51 AM, Adrian Wilke reply@reply.github.com wrote:

Hello Min,

we have trained ParsCit for extracting references of some journals. One result is a set of heuristics for a training: http://pgknowaan.wordpress.com/2011/09/18/how-to-train-parscit-for-scientific-journals/


The training for ParsHed does not work :-\ We tried to use the code from http://aye.comp.nus.edu.sg/parsCit/#faq and use it like this:

#!/bin/bash
cd /opt/ParsCit/crfpp/traindata
../../bin/tr2crfpp.pl tagged_headers.txt > parsHed.train.test.data
../crf_learn parsHed.template parsHed.train.test.data parsHed.test.model
mv parsHed.test.model ../../resources/parsHed/parsHed.test.model

There are problems in the CRF++ process:

# Copyright 2005 � by Min-Yen Kan

CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2008 Taku Kudo, All rights reserved.

reading training data: tagger.cpp(162) [feature_index_->buildFeatures(this)] feature.cpp(154) [apply_rule(&os, *it, cur, *tagger)]  format error: U01:%x[0,23]
0.01 s

How can we fix this?

Best regards, Adrian

Reply to this email directly or view it on GitHub: https://github.com/knmnyn/ParsCit/issues/4

adibaba commented 12 years ago

Hello Min,

do not bother him, it was my fault. I used the wrong script :-\ https://github.com/knmnyn/ParsCit/issues/4#issuecomment-2320088

Best regards, Adrian

Am 09.10.2011 02:57, schrieb Min-Yen Kan:

Hi Adrian,

Thanks for your bug report. I've asked Huy, our current RA, to look into these problems. We'll hopefully be able to get back to you soon.

Cheers,

Min

Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

On Thu, Oct 6, 2011 at 4:51 AM, Adrian Wilke reply@reply.github.com wrote:

Hello Min,

we have trained ParsCit for extracting references of some journals. One result is a set of heuristics for a training: http://pgknowaan.wordpress.com/2011/09/18/how-to-train-parscit-for-scientific-journals/


The training for ParsHed does not work :-\ We tried to use the code from http://aye.comp.nus.edu.sg/parsCit/#faq and use it like this:

#!/bin/bash
cd /opt/ParsCit/crfpp/traindata
../../bin/tr2crfpp.pl tagged_headers.txt>  parsHed.train.test.data
../crf_learn parsHed.template parsHed.train.test.data parsHed.test.model
mv parsHed.test.model ../../resources/parsHed/parsHed.test.model

There are problems in the CRF++ process:

# Copyright 2005 � by Min-Yen Kan

CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2008 Taku Kudo, All rights reserved.

reading training data: tagger.cpp(162) [feature_index_->buildFeatures(this)] feature.cpp(154) [apply_rule(&os, *it, cur, *tagger)]  format error: U01:%x[0,23]
0.01 s

How can we fix this?

Best regards, Adrian

Reply to this email directly or view it on GitHub: https://github.com/knmnyn/ParsCit/issues/4