Closed drprabhakar closed 9 years ago
Do the links appear in green in the UI when you tick the overlay blocked links
checkbox?
No, the links appear in red even though I have given pattern in '
First check if those links have a no follow
attribute.
Secondly you need to create a middleware to download the PDF links and convert them to html
[no follow] attribute is not available for those links. I have python script to convert PDF to HTML which I have use while parsing PDF links in Portia and working fine. Is there any option to follow PDF links?
I don't see any reason for them not to be followed. What is the url you are crawling so that I can investigate.
Currently I am working on this link I just want that Portia has to follow the PDF links which is given under [Important Downloads] section in the webpage
The link isn't followed because it is not on the same domain that the spider is crawling
Please check this link also in which crawling spider and PDF links are in same domain. PDF links under section [Download now:]
The second one should work assuming that you have your filters correct and your middleware correctly placed
Sorry to say that the second link I have mentioned is not working. While loading the link in Portia, the PDF links are highlighted in red. I seems that those links would not be followed. Also I have checked with the option [Follow all in-domain links], still highlighted in red.
As for the Initial part, I just want Portia to follow PDF links by confirming that the PDF links to be highlighted in green. In the second part I will convert the PDF to HTML(this is not necessary right now).
In your pdf to html middleware you need to remove pdf
from IGNORED_EXTENSIONS
.
To do this add the following to your middleware:
from scrapy.linkextractor import IGNORED_EXTENSIONS
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
IGNORED_EXTENSIONS.remove('pdf')
_ignored_exts = frozenset(['.' + e for e in IGNORED_EXTENSIONS])
class PdfDownloaderMiddleware:
def __init__(self, ...):
...
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
def spider_opened(self, spider):
if hasattr(spider, 'plugins') and spider['plugins'].get('Annotations'):
annotations = spider['plugins'].get('Annotations')
annotations.html_link_extractor = HtmlLinkExtractor(ignored_extensions=_ignored_exts)
I haven't tested this code but the idea is to replace the link extractor with one that will allow pdf links to be followed
Yes, its working now after removing "pdf" from "IGNORED_EXTENSIONS" Thanks a lot.
I am deploying my Portia spider in scrapyd. I have given a pattern to be followed in Crawling section in Portia. While deploying the spider, links are not following the link pattern which I have given. How to fix this issue?