infinilabs / crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
Other
307 stars 82 forks source link

PDFs not show showing up in the search results #39

Closed Nationwidechildrens closed 5 years ago

Nationwidechildrens commented 5 years ago

I can see some data about PDF's in the elasticsearch gopa-index and gopa-snapshot.

GET /gopa-index/_search { "query": { "multi_match": { "query": ".pdf", "fields": ["snapshot.ext","snapshot.url"] } } }

this returns about 2391 hits. But if search for any of the metadata in the pdf. I dont get any PDF's in the results. I have even tried using words in the filename and the results still do not show any PDF's.

medcl commented 5 years ago

Hi, @Nationwidechildrens currently the PDF is not processed, but is doable for sure

medcl commented 5 years ago

@Nationwidechildrens the parse_pdf joint is pushed to master, fell free to try out.