knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

different in number of headers vs. the number of generic headers #15

Closed davidrapoport closed 9 years ago

davidrapoport commented 10 years ago

"iconv -f utf-8 -t utf-8 -c " run before each paper is extracted. Email me at drapoport847 at gmail dot com for a list of papers which cause this error

GNU nano 2.0.6 File: log.txt

184.175.2.245 - - [19/Aug/2014 11:58:51] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in

' Die: SectLabel::Controller::getGenericHeaders different in number of headers 38 vs. the number of generic headers 0 184.175.2.245 - - [19/Aug/2014 12:00:49] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:02:16] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:03:18] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
' Die: SectLabel::Controller::getGenericHeaders different in number of headers 13 vs. the number of generic headers 0 184.175.2.245 - - [19/Aug/2014 12:04:20] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:05:27] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
' Die: SectLabel::Controller::getGenericHeaders different in number of headers 23 vs. the number of generic headers 0 184.175.2.245 - - [19/Aug/2014 12:06:39] "POST /pc/upload HTTP/1.1" 200 - Citation text longer than article body: ignoring 184.175.2.245 - - [19/Aug/2014 12:08:09] "POST /pc/upload HTTP/1.1" 200 - Citation text longer than article body: ignoring 184.175.2.245 - - [19/Aug/2014 12:10:21] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:12:57] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
' Die: SectLabel::Controller::getGenericHeaders different in number of headers 15 vs. the number of generic headers 0 184.175.2.245 - - [19/Aug/2014 12:14:18] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:16:18] "POST /pc/upload HTTP/1.1" 200 - Citation text longer than article body: ignoring 184.175.2.245 - - [19/Aug/2014 12:17:58] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:19:15] "POST /pc/upload HTTP/1.1" 200 - /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in split': invalid byte sequence in US-ASCII (ArgumentError) from /Users/logp/ParsCit/bin/sectLabel/genericSect/extractFeature.rb:35:in
' Die: SectLabel::Controller::getGenericHeaders different in number of headers 9 vs. the number of generic headers 0 184.175.2.245 - - [19/Aug/2014 12:20:37] "POST /pc/upload HTTP/1.1" 200 - 184.175.2.245 - - [19/Aug/2014 12:21:43] "POST /pc/upload HTTP/1.1" 200 -

knmnyn commented 10 years ago

It'd be good to have some source files after iconv to test with. David, can you provide these?

cmkumar87 commented 10 years ago

Hi David

"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because otherwise I would get a UTF error."

I notice that your command for iconv specifies both your from and to file formats as utf-8. If your input is already in utf8 why would you convert it to utf8? Could you please check what your input format is and is iconv converting anything at all?

Thanks!

Muthu

On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:

It'd be good to have some source files after iconv to test with. David, can you provide these?

— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52734799.

davidrapoport commented 10 years ago

Hi Muthu, Before sending it to the webservice I run pdftotext -raw

pdftotext version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC

I run iconv in this way because if I do not some papers give me this error

Malformed UTF-8 character (unexpected continuation byte 0xad, with no preceding start byte) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216. Malformed UTF-8 character (unexpected non-continuation byte 0x61, immediately after start byte 0xe9) in pattern match (m//) at /Users/logp/ParsCit/bin/../lib/SectLabel/Tr2crfpp.pm line 216.

However, I have disabled any preprocessing on the papers and I still receive the original error when run with certain papers. I will email a list of papers which have caused the error.

davidrapoport commented 10 years ago

Attached are 3 papers (pdf and result after running pdftotext -raw).

On Wed, Aug 20, 2014 at 12:03 PM, cmkumar87 notifications@github.com wrote:

Hi David

"Before running ParsCit I run "iconv -f utf-8 -t utf-8 -c " because otherwise I would get a UTF error."

I notice that your command for iconv specifies both your from and to file formats as utf-8. If your input is already in utf8 why would you convert it to utf8? Could you please check what your input format is and is iconv converting anything at all?

Thanks!

Muthu

On 20 August 2014 13:14, Min-Yen Kan notifications@github.com wrote:

It'd be good to have some source files after iconv to test with. David, can you provide these?

— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52734799.

— Reply to this email directly or view it on GitHub https://github.com/knmnyn/ParsCit/issues/15#issuecomment-52800177.

cmkumar87 commented 9 years ago

David's files work on our webservice at http://aye.comp.nus.edu.sg/parsCit/. The download we provide on the same page is a replica of the codebase that runs our webservice. So we aren't sure what's causing the reported error David's end. Please get in touch with us with us if you have anymore specfic error logs.

Thanks!