robbi5 / kleineanfragen

Collecting kleine Anfragen from Parlamentsdokumentationssystemen for easy search- and linkability
https://kleineanfragen.de
MIT License
43 stars 9 forks source link

Rotate pages before extraction #98

Open robbi5 opened 8 years ago

robbi5 commented 8 years ago

Some papers have broken/unsearchable text, because some pages should have been rotated before extracting.

Example: https://kleineanfragen.de/schleswig-holstein/18/406 Extracted Text: https://kleineanfragen.de/schleswig-holstein/18/406-gremienmitgliedschaften-der-regierungsmitglieder-und-staatssekretaere.txt

Fr
ag

e
n

 1
,3

 u
n

d
 4

:  
G

re
m

ie
n

 im
 S

in
n

e 
d
robbi5 commented 8 years ago

Apache TIKA Bug: https://issues.apache.org/jira/browse/TIKA-723 "Rotated text isn't extracted correctly from PDFs"