Links are not following while deploying Portia Spider

scrapinghub / portia

Visual scraping for Scrapy

BSD 3-Clause "New" or "Revised" License

9.3k stars 1.41k forks source link

Links are not following while deploying Portia Spider #208

Closed drprabhakar closed 9 years ago

drprabhakar commented 9 years ago

I am deploying my Portia spider in scrapyd. I have given a pattern to be followed in Crawling section in Portia. While deploying the spider, links are not following the link pattern which I have given. How to fix this issue?

ruairif commented 9 years ago

Do the links appear in green in the UI when you tick the overlay blocked links checkbox?

drprabhakar commented 9 years ago

No, the links appear in red even though I have given pattern in ''. I forgot to mention that the links are for PDFs. Is there any changes required in Portia Source level to follow PDF links?

ruairif commented 9 years ago

First check if those links have a no follow attribute. Secondly you need to create a middleware to download the PDF links and convert them to html

drprabhakar commented 9 years ago

[no follow] attribute is not available for those links. I have python script to convert PDF to HTML which I have use while parsing PDF links in Portia and working fine. Is there any option to follow PDF links?

ruairif commented 9 years ago

I don't see any reason for them not to be followed. What is the url you are crawling so that I can investigate.

drprabhakar commented 9 years ago

Currently I am working on this link I just want that Portia has to follow the PDF links which is given under [Important Downloads] section in the webpage

ruairif commented 9 years ago

The link isn't followed because it is not on the same domain that the spider is crawling

drprabhakar commented 9 years ago

Please check this link also in which crawling spider and PDF links are in same domain. PDF links under section [Download now:]

ruairif commented 9 years ago

The second one should work assuming that you have your filters correct and your middleware correctly placed

drprabhakar commented 9 years ago

Sorry to say that the second link I have mentioned is not working. While loading the link in Portia, the PDF links are highlighted in red. I seems that those links would not be followed. Also I have checked with the option [Follow all in-domain links], still highlighted in red.

As for the Initial part, I just want Portia to follow PDF links by confirming that the PDF links to be highlighted in green. In the second part I will convert the PDF to HTML(this is not necessary right now).

ruairif commented 9 years ago

In your pdf to html middleware you need to remove pdf from IGNORED_EXTENSIONS.

To do this add the following to your middleware:

from scrapy.linkextractor import IGNORED_EXTENSIONS
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

IGNORED_EXTENSIONS.remove('pdf')
_ignored_exts = frozenset(['.' + e for e in IGNORED_EXTENSIONS])

class PdfDownloaderMiddleware:
    def __init__(self, ...):
        ...
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)

    def spider_opened(self, spider):
        if hasattr(spider, 'plugins') and spider['plugins'].get('Annotations'):
            annotations = spider['plugins'].get('Annotations')
            annotations.html_link_extractor = HtmlLinkExtractor(ignored_extensions=_ignored_exts)

I haven't tested this code but the idea is to replace the link extractor with one that will allow pdf links to be followed

drprabhakar commented 9 years ago

Yes, its working now after removing "pdf" from "IGNORED_EXTENSIONS" Thanks a lot.