soprasteria / cybersecurity-dfm

Data Feed Manager (news watch orchestrator to predict topic with deepdetect and store cleaned text in elasticsearch)
GNU General Public License v3.0
40 stars 14 forks source link

Wrong Content-Type #6

Open acabrol opened 6 years ago

acabrol commented 6 years ago

DFM doesn't get the correct content type for some documents.

Here under an example:

CURL request:

curl -I -XGET https://arxiv.org/pdf/1801.01681v1.pdf
HTTP/1.1 200 OK
Date: Thu, 11 Jan 2018 08:38:34 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=86.250.248.55.1515659914652413; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Mon, 08 Jan 2018 01:42:36 GMT
ETag: "16b79425-213180-56239e9adddf8"
Accept-Ranges: bytes
Content-Length: 2175360
Access-Control-Allow-Origin: *
Content-Type: application/pdf

DFM Log:

DEBUG in feed [cybersecurity-dfm/dfm/feed.py:572]:
Content-Type:text/html; charset=utf-8 url:https://arxiv.org/pdf/1801.01681v1.pdf
acabrol commented 6 years ago

As work around pdf mime type is forced when ".pdf" is included in the link.

However for arxiv the pdf files seem to be non standard format: ShellError: The commandpdftotext /tmp/tmp87ul1x -failed with exit code 1 ------------- stdout ------------- ------------- stderr ------------- Syntax Warning: May not be a PDF file (continuing anyway) Syntax Error (2): Illegal character <21> in hex string Syntax Error (4): Illegal character <4f> in hex string Syntax Error (6): Illegal character <54> in hex string Syntax Error (7): Illegal character <59> in hex string Syntax Error (8): Illegal character <50> in hex string Syntax Error (11): Illegal character <48> in hex string Syntax Error (12): Illegal character <54> in hex string Syntax Error (13): Illegal character <4d> in hex string Syntax Error (14): Illegal character <4c> in hex string Syntax Error (16): Illegal character <50> in hex string Syntax Error (17): Illegal character <55> in hex string