Closed rlskoeser closed 2 months ago
I have a crawler written in scrapy that I would love to port to spider-py, but I can't figure out how to duplicate some of the current functionality. If I can get it ported, it will be much faster and should require less custom code.
I see how to get the url, status code, and title for the crawled page; I have a basic crawler that uses a
subscription
to create a CSV report of crawled pages with those fields. Is there a way to access response headers? My current crawler reports include content type, last modified, and content length when they are set in the response.I'm also interested in customizing how links are selected: my current report includes assets like images and iframes, and can report when they return error codes and the referring link / page so they can be fixed. (Report includes referring url.)
Are these things feasible currently or in future with spider-py?
Hi there! Yes these features are feasible. The response headers we have under a feature flag called "headers" ( we can enable this ).
We have a config option called full_resources allowing you to collect everything. Exposing the query selector for the links is not available atm.
Cool, let me know how/when I can try these features.
Cool, let me know how/when I can try these features.
Published in v0.0.42
. You can get the response headers from the page and collect full resources. Thank you!
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_full_resources(True)
asyncio.run(main())
with_full_resources(True)
is fantastic!
How do I access the response headers on the page object?
with_full_resources(True)
is fantastic!How do I access the response headers on the page object?
Glad it works well! The header data can be retrieved at page.headers
on the Page object.
I don't get any content when I try to access page.headers
, and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headers
I don't get any content when I try to access
page.headers
, and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headers
You are correct, I forgot to port the headers
to the Page class. Will do this shortly.
I don't get any content when I try to access
page.headers
, and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headersYou are correct, I forgot to port the
headers
to the Page class. Will do this shortly.
available 0.0.43
under headers
thank you so much!
Thanks for the answers and quick releases!
Thanks for the answers and quick releases!
Feel free to ask anymore questions. Have fun on the port!
I have a crawler written in scrapy that I would love to port to spider-py, but I can't figure out how to duplicate some of the current functionality. If I can get it ported, it will be much faster and should require less custom code.
I see how to get the url, status code, and title for the crawled page; I have a basic crawler that uses a
subscription
to create a CSV report of crawled pages with those fields. Is there a way to access response headers? My current crawler reports include content type, last modified, and content length when they are set in the response.I'm also interested in customizing how links are selected: my current report includes assets like images and iframes, and can report when they return error codes and the referring link / page so they can be fixed. (Report includes referring url.)
Are these things feasible currently or in future with spider-py?