gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
24 stars 11 forks source link

Retrieved information from PDF file is not readable. #8

Closed sangee2004 closed 3 months ago

sangee2004 commented 4 months ago

Steps to reproduce the problem:

  1. Ingest this pdf file - Insurance_Handbook_20103.pdf . This succeeds.
  2. Try to retrieve information from this file. The retrieval returns information that is not readable.

%knowledge retrieve -k 10 -d testnewpdf "What are the different types of insurance that needs to be purchased by business owners?" Retrieved the following 10 sources for the query "What are the different types of insurance that needs to be purchased by business owners?" from dataset "testnewpdf": [{"content":"hofp6R1SroAn in€ho\u0005e usiness \u0001oic\t \u0001ro\fides \u0005ore co\u0005\u0001rehensi\fe co\ferage \u0006or usiness e˜ui\u0001\u0005ent and iaiit\t than a \nho\u0005eowners \u0001oic\t endorse\u0005ent\u0003 „an\t insurance co\u0005\u0001anies o\u0006\u0006er \ninsurance \u0001oicies s\u0001eci\u0006ica\t taiored to s\u0005a usiness\u0003 bokThRBihhpzBilhofp6R1So9knf0r he ho\u0005e usiness \u0005ight e eigie \u0006or he ‡usinessowners Œoic\t \u0004‡ƒŒ\u0002\b see ao\fe\u0003 he ke\t to whether a usiness \nowner is eigie \u0006or a ‡ƒŒ is the si…e o\u0006 the \u0001re\u0005ises\b the i\u0005its o\u0006 iaiit\t \nre˜uired\b the t\t\u0001e o\u0006 co\u0005\u0005ercia o\u0001eration it is and the eŸtent o\u0006 its \no\u0006\u0006€\u0001re\u0005ises ser\ficing and \u0001rocessing acti\fities\u0003 A ‡ƒŒ\b ike an in€ho\u0005e \nusiness \u0001oic\t\b co\fers usiness \u0001ro\u0001ert\t and e˜ui\u0001\u0005ent\b oss o\u0006 inco\u0005e\b \neŸtra eŸ\u0001ense and iaiit\t¥ howe\fer\b the ‡ƒŒ \u0001ro\fides these co\ferages on a \n\u0005uch roader scae\u0003Business InsuranceInsurance Basics","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"22","totalPages":"205"},"similarity_score":0.82956576},{"content":"ides these co\ferages on a \n\u0005uch roader scae\u0003Business InsuranceInsurance Basics","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"22","totalPages":"205"},"similarity_score":0.8260809},{"content":"d lnes axc ibix DaPPnxC anltlaDsk S,ir ali xes tebilic -Del caGaCi –r unxc el cnpiapikHomeowners InsuranceInsurance Basics","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"12","totalPages":"205"},"similarity_score":0.81536484},{"content":"Insusrace HdcborackTHhd.lTcrace.l. .oirtypfsftmnmpyfvyftInsurance Handbook A guide to insurance: what it does and how it works","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"2","totalPages":"205"},"similarity_score":0.81512237},{"content":"Insurance Handbook A guide to insurance: what it does and how it works","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"1","totalPages":"205"},"similarity_score":0.81444156},{"content":"ises or esewhere\b or in auto \naccidents whie on usiness\u0003 \u0007t aso co\fers work€reated inesses\u0003 †orkers co\u0005€Insurance BasicsBusiness Insurance","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"19","totalPages":"205"},"similarity_score":0.8139653},{"content":"Insurance Topics \u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000buraA g g gA uidetuonAsturacc:AAAwwwghhhgcek\rhuidetuon\fturacc:AAkb\u0001u\u0002o Insurance\u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000bura Insurance Topics“orkers …o„pensa\u0002ionracblRCb–DhlCdbvaoJhW 5 bBcbdnk yAcHdoer Hnd2decydnobcnoayrHrdecbcanS2neRcnrHar.rauoxndeoecd7nscHcboxx2nS2nk yyrddr Hdn bnS obadnhR dcnbcdA HdrSrxre2nrenrdne ncHdubcnk yAxroHkcnhreRneRcnxohd7nrH.cdersoecnoHanackracnardAuecankodcd7npoHank xxckenaoeoDnGHny dendeoecdncyAx 2cbdnobcnbc(urbcane nBccAnbck badn lnokkr0pacHedDnIkkracHednyudenScnbcA becane neRcnh bBcbdnk yAcHdoer HnS obanoHane npeRcnk yAoH2OdnrHdubcbnhreRrHnondAckrlrcanHuyScbn lnao2dD5 bBcbdnk yAcHdoer Hnk .cbdnoHnrH9ubcanh bBcbOdnycarkoxnkobcnoHanoeecyAedne nk .cbnRrdn bnRcbnck H yrknx ddDnfRrdnrHkxuacdnx ddn lncobHrHsdnoHanpeRcncXeboncXAcHdcdnodd kroecanhreRneRcnrH9ub2DnGH9ubcanh bBcbdnbckcr.cnoxxnycar0pkoxx2nHckcddob2noHanoAAb Abroecneb","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"82","totalPages":"205"},"similarity_score":0.8112802},{"content":"Insurance Topics \u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000buraA g g gA uidetuonAsturacc:AAAwwwghhhgcek\rhuidetuon\fturacc:AAkH\u0001u\u0002o Insurance\u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000bura Insurance Topics“orkers …o„pensa\u0002ionvaoJCohbPadGCthplnat5 bBcbdnk yAcHdoer HnrHduboHkcnk .cbdneRcnk den lnycarkoxnkobcnoHanbcRoSrxreo0er Hnl bnh bBcbdnrH9ubcan HneRcn9 SDnGenoxd nk yAcHdoecdneRcynl bnx denhoscdnoHanpAb .racdnacoeRnScHclrednl bneRcrbnacAcHacHednrlneRc2nobcnBrxxcanrHnh bB0bcxoecanpokkracHed7nrHkxuarHsnecbb brdenoeeokBdDnfRcnh bBcbdnk yAcHdoer Hnd2decynrdneRcnpJcXkxudr.cnbcyca2/nl bn H0eRc09 SnrH9ubrcdndullcbcanS2ncyAx 2ccdDnIdnAoben lneRcnpd kroxnk HebokencyScaacanrHncokRndeoecOdnxoh7neRcncyAx 2ccnsr.cdnuAneRcnbrsRene npducneRcncyAx 2cbnl bnrH9ubrcdnkoudcanS2neRcncyAx 2cbOdnHcsxrscHkcnoHanrHnbceubHnpbckcr.cdnh bBcbdnk yAcHdoer HnScHclrednbcsobaxcddn lnhR n bnhRoenkoudcaneRcnpokkracHe7nodnx HsnodnrenRoAAcHcanrHneRcnh bBAxokcn","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"80","totalPages":"205"},"similarity_score":0.8104987},{"content":"Insurance Topics \u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000buraA g g gA uidetuonAsturacc:AAAwwwghhhgcek\rhuidetuon\fturacc:AAo\u0010\u0001u\u0002o Insurance\u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000bura Insurance TopicsTerroris„ .isk and InsuranceyCooaonhdbinhJbptSbgtheoptuCb:br bne nwcAecyScbn337nPpp37nrHdubcbdnAb .racanecbb brdynk .cboscne neRcrbnk y0ycbkroxnrHduboHkcnkude ycbdncddcHeroxx2nlbccn lnkRobscnSckoudcneRcnkRoHkcn lnpAb Acbe2naoyoscnlb ynecbb brdenokednhodnk Hdracbcanbcy ecDnIlecbnwcAecyScbn337nphRrkRnk dednrHdubcbdnoS uenZW3DUnSrxxr H7nrHdubcbdnScsoHne nbcoddcddneRcnbrdBDni bnponhRrxcnecbb brdynk .cboscnhodndkobkcDnFcrHdubcbdnhcbcnuHhrxxrHsne nbcrHdubcnA x0prkrcdnrHnubSoHnobcodnAcbkcr.cane nScn.uxHcboSxcne noeeokBDn:bryob2nrHdubcbdnlrxcanpbc(ucdednhreRneRcrbndeoecnrHduboHkcnacAobeycHednl bnAcbyrddr Hne ncXkxuacnecbb b0prdynk .cboscnlb yneRcrbnk yycbkroxnA xrkrcdDq HkcbHcanoS ueneRcnxryrecano.orxoSrxre2n lnecbb brdynk .cboscnrHnRrsR0brdBnobcodnoHanrednryAoken Hne","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"76","totalPages":"205"},"similarity_score":0.8073565},{"content":"Insurance Topics \u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000buraA g g gA uidetuonAsturacc:AAAwwwghhhgcek\rhuidetuon\fturacc:AAok\u0001u\u0002o Insurance\u000e\r\f\u000bura \u000bu \n\n\n\t\b\b\b\tn\u0007\u0006\u0005\baa\u0004ra\u0003\u0004\r\f\u000bura Insurance Topics.esiduaw ƒarke\u0002siChnSepxbNpoJClbkxptbNCosCohW GHnPppPnix braoOdneh nbcdrauoxnyobBcen bsoHrCoer Hd7neRcn‘NInoHaneRcnix braon5rHade bynNHacbhbrerHsnIdd kroer H7nycbscane nSck ycneRcnqrerCcHdOn:b Acbe2nGHduboHkcnq bA boer Hn8q:GqtDnfRcnpix braonq:Gq7nBH hHnodnqrerCcHd7nRodnoneoX0cXcyAendeoeudDnfRrdnlcoeubcncHoSxcdnrenpe nlrHoHkcnx ddnAo2ycHednrHneRcnc.cHen lnonyo9 bnardodecbnS2nrddurHsneoX0cXcyAenpS HadneRoenkobb2nx hnrHecbcdenboecd7neRudnbcaukrHsnlrHoHkrHsnk dedn .cbneRcn2cobdnpS2nRuHabcadn lnyrxxr Hdn lna xxobdDnGHng urdroHo7nl xx hrHsnix braoOdny acx7neRcnpiIGFn:xoHnoHaneRcnq odeoxn:xoHnSckoycneRcng urdroHonqrerCcHdn:b Acbe2nGHdub0poHkcnq bA boer HnrHnPpp4DUBS1ksL1Gcaohes.hkw1BsmdBcBc1G32fhbIaob•lRCobwntChbaIbgthe","metadata":{"filename":"Insurance_Handbook_20103.pdf","page":"74","totalPages":"205"},"similarity_score":0.80728185}]

iwilltry42 commented 4 months ago

This seems to be related to the PDF encoding not being detected (or converted) correctly. According to MuPDF Tools (mutools info), the test file is in winansiencoding. I'll do some more research on this.

StrongMonkey commented 3 months ago

Btw this seems to be passing with latest pdf parser. Can be re-tested once new release is out.

sangee2004 commented 3 months ago

Tested with knowledge version - v0.1.6. This issue is not seen anymore. Able to retrieve information from this file successfully.