pdf - Githubissues

kodejuice / localGoogoo

:mag: A search engine for your offline websites.

MIT License

40 stars 10 forks source link

pdf #1

Open xuze1993 opened 5 years ago

xuze1993 commented 5 years ago

I've pulled a website from webhttrack which is mixed of pdf and html,it seems that localgoogle can only index html files,is there anyway to solve the problem?

kodejuice commented 5 years ago

The code can be modified to read pdf documents (with a pdf library) while crawling and index it, but a copy of the file would need to be kept so the user can open it in the search result page. thats not good i think.

kodejuice commented 5 years ago

The code can be modified to read pdf documents (with a pdf library) while crawling and index it, but a copy of the file would need to be kept so the user can open it in the search result page. thats not good i think.

or we could just use the original pdf link in the search results, but if the original file is longer available, you wont be able to open it

xuze1993 commented 5 years ago

gocha,nice work anyway. Sadly that fewer static sites are left on the web.