samirettali / tor-spider

A spider for Hidden Services
4 stars 0 forks source link

[Suggestion] filter non-html page from collector #7

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi,

Hope you are all well !

It would be interesting to exclude non-html page from indexing content into elasticsearch, or to create a mime/type detector to detect images or pdf documents and create dedicated sub-processing for the binary types.

For now, I just added:

        if title == "" {
            spider.Logger.Error(errors.New("not an html page"))
            return
        }

Cheers, X