It would be interesting to exclude non-html page from indexing content into elasticsearch, or to create a mime/type detector to detect images or pdf documents and create dedicated sub-processing for the binary types.
For now, I just added:
if title == "" {
spider.Logger.Error(errors.New("not an html page"))
return
}
Hi,
Hope you are all well !
It would be interesting to exclude non-html page from indexing content into elasticsearch, or to create a mime/type detector to detect images or pdf documents and create dedicated sub-processing for the binary types.
For now, I just added:
Cheers, X