Closed davidrapoport closed 10 years ago
It'd be good to have some source files after iconv to test with. David, can you provide these?
Hi David
"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because otherwise I would get a UTF error."
I notice that your command for iconv specifies both your from and to file formats as utf-8. If your input is already in utf8 why would you convert it to utf8? Could you please check what your input format is and is iconv converting anything at all?
Thanks!
Muthu
On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:
It'd be good to have some source files after iconv to test with. David, can you provide these?
— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52734799.
Hi Muthu, Before sending it to the webservice I run pdftotext -raw
pdftotext version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC
I run iconv in this way because if I do not some papers give me this error
Malformed UTF-8 character (unexpected continuation byte 0xad, with no preceding start byte) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216. Malformed UTF-8 character (unexpected non-continuation byte 0x61, immediately after start byte 0xe9) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216.
However, I have disabled any preprocessing on the papers and I still receive the original error when run with certain papers. I will email a list of papers which have caused the error.
Attached are 3 papers (pdf and result after running pdftotext -raw).
On Wed, Aug 20, 2014 at 12:03 PM, cmkumar87 notifications@github.com wrote:
Hi David
"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because otherwise I would get a UTF error."
I notice that your command for iconv specifies both your from and to file formats as utf-8. If your input is already in utf8 why would you convert it to utf8? Could you please check what your input format is and is iconv converting anything at all?
Thanks!
Muthu
On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:
It'd be good to have some source files after iconv to test with. David, can you provide these?
— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52734799.
— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52800177.
David's files work on our webservice at http://aye.comp.nus.edu.sg/parsCit/. The download we provide on the same page is a replica of the codebase that runs our webservice. So we aren't sure what's causing the reported error David's end. Please get in touch with us with us if you have anymore specfic error logs.
Thanks!
"iconv -f utf-8 -t utf-8 -c " run before each paper is extracted. Email me at drapoport847 at gmail dot com for a list of papers which cause this error
GNU nano 2.0.6 File: log.txt
184.175.2.245 - - [19/Aug/2014 11:58:51] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in'
Die: SectLabel::Controller::getGenericHeaders different in number of headers 38 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:00:49] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:02:16] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:03:18] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in '
Die: SectLabel::Controller::getGenericHeaders different in number of headers 13 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:04:20] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:05:27] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in '
Die: SectLabel::Controller::getGenericHeaders different in number of headers 23 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:06:39] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:08:09] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:10:21] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:12:57] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in '
Die: SectLabel::Controller::getGenericHeaders different in number of headers 15 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:14:18] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:16:18] "POST /pc/upload HTTP/1.1" 200 -
Citation text longer than article body: ignoring
184.175.2.245 - - [19/Aug/2014 12:17:58] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:19:15] "POST /pc/upload HTTP/1.1" 200 -
/Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in '
Die: SectLabel::Controller::getGenericHeaders different in number of headers 9 vs. the number of generic headers 0
184.175.2.245 - - [19/Aug/2014 12:20:37] "POST /pc/upload HTTP/1.1" 200 -
184.175.2.245 - - [19/Aug/2014 12:21:43] "POST /pc/upload HTTP/1.1" 200 -
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in