[Suggestion] filter non-html page from collector

Hi,

Hope you are all well !

It would be interesting to exclude non-html page from indexing content into elasticsearch, or to create a mime/type detector to detect images or pdf documents and create dedicated sub-processing for the binary types.

For now, I just added:

        if title == "" {
            spider.Logger.Error(errors.New("not an html page"))
            return
        }

Cheers, X

samirettali / tor-spider

[Suggestion] filter non-html page from collector #7