Access additional page / response details and customize get links

rlskoeser commented 2 months ago

I have a crawler written in scrapy that I would love to port to spider-py, but I can't figure out how to duplicate some of the current functionality. If I can get it ported, it will be much faster and should require less custom code.

I see how to get the url, status code, and title for the crawled page; I have a basic crawler that uses a subscription to create a CSV report of crawled pages with those fields. Is there a way to access response headers? My current crawler reports include content type, last modified, and content length when they are set in the response.

I'm also interested in customizing how links are selected: my current report includes assets like images and iframes, and can report when they return error codes and the referring link / page so they can be fixed. (Report includes referring url.)

Are these things feasible currently or in future with spider-py?

j-mendez commented 2 months ago

I have a crawler written in scrapy that I would love to port to spider-py, but I can't figure out how to duplicate some of the current functionality. If I can get it ported, it will be much faster and should require less custom code.

I see how to get the url, status code, and title for the crawled page; I have a basic crawler that uses a subscription to create a CSV report of crawled pages with those fields. Is there a way to access response headers? My current crawler reports include content type, last modified, and content length when they are set in the response.

I'm also interested in customizing how links are selected: my current report includes assets like images and iframes, and can report when they return error codes and the referring link / page so they can be fixed. (Report includes referring url.)

Are these things feasible currently or in future with spider-py?

Hi there! Yes these features are feasible. The response headers we have under a feature flag called "headers" ( we can enable this ).

We have a config option called full_resources allowing you to collect everything. Exposing the query selector for the links is not available atm.

rlskoeser commented 2 months ago

Cool, let me know how/when I can try these features.

j-mendez commented 2 months ago

Cool, let me know how/when I can try these features.

Published in v0.0.42. You can get the response headers from the page and collect full resources. Thank you!

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_full_resources(True)

asyncio.run(main())

rlskoeser commented 2 months ago

with_full_resources(True) is fantastic!

How do I access the response headers on the page object?

j-mendez commented 2 months ago

with_full_resources(True) is fantastic!

How do I access the response headers on the page object?

Glad it works well! The header data can be retrieved at page.headers on the Page object.

rlskoeser commented 2 months ago

I don't get any content when I try to access page.headers , and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headers

j-mendez commented 2 months ago

I don't get any content when I try to access page.headers , and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headers

You are correct, I forgot to port the headers to the Page class. Will do this shortly.

j-mendez commented 2 months ago

I don't get any content when I try to access page.headers , and I think it's crashing my subscription class - I don't get any output in my CSV report when I try to access the headers

You are correct, I forgot to port the headers to the Page class. Will do this shortly.

available 0.0.43 under headers thank you so much!

rlskoeser commented 2 months ago

Thanks for the answers and quick releases!

j-mendez commented 2 months ago

Thanks for the answers and quick releases!

Feel free to ask anymore questions. Have fun on the port!

spider-rs / spider-py

Access additional page / response details and customize get links #6