jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

pdfplumber extracting wrong text from pdf #815

Closed siddhantajain closed 1 year ago

siddhantajain commented 1 year ago

I'm trying to extract text from a pdf that contains text only on the 2nd last page. I'm using the extract_text() function. Got an unreadable extract for that particular page.

RH_Q4 2022_Prepared Remarks_jm.pdf

Text is present on the 2nd last page of the pdf.

the output I got:-

[i.i] RobeIfr t Ha\nProtciovnittiitn ohu aevsaev ersyt ropnigp ealcirnoaesn is n creadsiivnegorlffsyee r oifn g\nsolutBioottnhhsr e.e gulraitasonkrcd yo mpliancea nptdre acchtniocleo,g yp rcaocntsiuclet ing\nshopwa rtiscturleanIrgn2 t 0h2P.2r otaicvhiiterive ecdo rdr-ehviegnohufn e esa r$l2by i l-lion\nevewnh iolvee rcotmhiweni gn d-doofavw enr lya rfgien asnecrivapilrc oejsae ncadts hiifntt h e\ntreonfpd u bsleiccet onrg agemteopn rtosj meocrtaesp plitcota abllseeon ltu tDieomnasnf.do r\nProtisveirtviri'ecsme asri onbsua snitd os n lmyi lidmlpya cbtyce udr reecnotn ocmoincd itions.\nWhitlhee rreem avionlsa itnti hlmeia tcyr oecoennovmiirco nwmeae rnoetp ,t imaibsotouiutcr\noutlfoor2o 0k2 W3e.h avseu ccesnsafvuilglmayat neeydc onocmyiccle easct,hi maec hieving\nhighpeera kTsh.iw sa sd emonstbryoa utareb di tloai cthyi tehvfeea stest irnoe ucro very\ncompanhyi'sstf oorlyl otwhCieOn VgI Dd-o1w9n tWuer na.cl osnot itnobu een effrioPtmr otiviti's\nresilwiheinsccthye ,fm rso imtd si verssoilfuiteoidffo enrsit nhgaasrtm e u clhe stsit eotd h e\neconocmyiccl e.\nLongteerrw me,a reen courbaytg hegedr owatnhmd a rgpirno spfercootmus or n gofioncgu s\nons ervriecleastt ote adl weinththi ghleervs ekli Tlhlessi.en clMuadnea gemReensto urces,\nFull-Etnigmaeg ePmreonfte ssMiaonnaagSleosdl, u tRioobneHsra,tTl efc hnoalnodPg ryo tiviti.\nIna ddittihsoetn r,u csthuitrforat el m owtoer pka,r ticwuilthahir glhsyek ric lrlesa,nt eews\ncompetaidtviavneat sai ghtei ghloiugnrhu tmse rsoturse nigntchlsuo,du girln ogbb raaln odff,i ce\nnetwocrakn,d iddaattaeb aansaded vanAcle-dd rtievcehnn olAolgsoioue,vrs e .rs yu ccessful\ninvestimnie nnntosv aanttdie ocnh no-lwohgiycc ohn ti-npuoes iutsit oomn e aningfully\nimprboovteth hd ei gaintrdae lc reuxipteerri feoonruc crel ieanntcdsa ndidaanttdeh isen ternal\nproducotfoi uvsrit tayff .\nWer emacionm mittoot uetrdi me-tceosrtpeopdru artpeot soce o,n npeecotp tlome e aningful\nanedx ciwtoirnakgn pdr ovcildieew nitttshht ea laennsdtu bjecte-xmpaetrtttheiernsy ee e tdo\nconfidceonmtpleaytn egd r ow.\nIc ounlodbt e m orper ooufad l olug rl otbeaalm isn,c lutdailsneognl tu tPiroontsai,nv di ti\ncorposreartvepi rcoefse sswihoohn aavlpesu s,to m ucehn eragnydd e diciantotiuoorr ne sults\nthyiesa Trh.e eiffro rmtasdp eo ssair belceon rudm boefar w aradnsad c colian2d 0e2s2 .\nFourth-rqeucaorgtnieinrtc iloubnde eindna gm eadso noef t hBee sWto rkplfaocPrea sr eTnM ts\nanhdo nobryeF do rbaesos n oef t hWeo rlTdo'Fpse male-FCroimepnadnlWiyeea sr.e\nparticpurlooaufrtd lh ryee cognwiect oinotnit noru eec efoirov uecr o mmittmode invte rsity,\nequaintidyn clusion.\nNowM,i kaen Idw oubledh apptoya nswyeoruq ru estPiloenasas.sej k u osntqe u estainoadn\nsinfogllleo wa-snu epe,d Ietfdh .e rtei'mwsee ,'c lolmb ea ctkoy ofuo ard ditqiuoensatli ons.\nQ&AS ession\nM.KEITWHA DDELPLR,E SIDAENNDTC EOR,O BERHTA LF:\nThawta so ulra qsute stTihoanny.ko f uo jro inuistn ogd ay.\nOPERATOR:\nThicso ncltuoddeasty e'lse confIyefor muei nscseae.nd py a ortft hcea liwlti, bl ela rchiivne d\naudfioor miantt hI en veCsetnotroe fRr o beHratl fs waetrb osbietret h.Ya oluaf l.sccooam dni al\nthceo nferceanrlcelep lDaiya.ld -eitnaa inltdsh c eo nfircmoadtaeir coeon n taiintn heed\nCompanpyr'erss esl eiassseue eadr ltioedra y.\nRobeHraQtl4 f2 02F2i nanRceisaull ts CCoanlPflre,er peanrceed Rem2a6r2,k0 s2,3 January 6\n©2 02R3o beHratIl nft ernIantAcin.Eo qnuaOalpl p ortEumnpiltoMyy/ eFr/ DisabilityNeterans."

jsvine commented 1 year ago

Hi @siddhantajain, and thanks for your interest in this library. A few observations:

Screen Shot
[...]
Protiviti continues to have a very strong pipeline across an increasingly diverse offering of
solutions. Both the regulatory risk and compliance and practice, technology practice consulting
show particular strength. In 2022 Protiviti achieved record-high revenues of nearly $2 billion -
even while overcoming the wind-down of a very large financial services project and a shift in the
trend of public sector engagements to projects more applicable to talent solutions. Demand for
Protiviti's services remains robust and is only mildly impacted by current economic conditions.
[...]

If it helps, see https://github.com/jsvine/pdfplumber/issues/764#issuecomment-1376936368 for a similar discussion.