howie6879 / ruia

Async Python 3.6+ web scraping micro-framework based on asyncio
https://www.howie6879.com/ruia/
Apache License 2.0
1.75k stars 181 forks source link

Would be nice to be able to pass in "start_urls" #134

Closed JacobJustice closed 3 years ago

JacobJustice commented 3 years ago

Ruia seems like a brilliant way to write simple and elegant web scrapers, but I can't figure out how to have a different "start_urls" value. I want a web scraper that can check all links on any GIVEN web page, not just whatever the start_urls lead me to, but also with the simplicity and asynchronous power that Ruia provides. Maybe this is a feature but I can't tell from the documentation or code

howie6879 commented 3 years ago

You mean just enter a url, such as https://docs.python-ruia.org/ Then Ruia automatically returns all the urls under this url? Such as: -https://docs.python-ruia.org/1 -https://docs.python-ruia.org/2 -https://docs.python-ruia.org/3

I'm not sure if I understand what you mean

JacobJustice commented 3 years ago

Yeah I realize my issue was unclear. Sorry about that.

I can't tell from the docs if this is already possible, but it'd be nice if start_urls could be passed as a parameter like in a constructor.

howie6879 commented 3 years ago

Do you only need all the links under a website? I think this is a feature, right?

We don’t know the specific format of all the links under a website, so there is no way to extract it automatically, unless you only need the url.

JacobJustice commented 3 years ago

My specific problem is I am searching for CVs or resumes belonging to a list of names.

I have a dataframe with a name column and 5 url columns (results from googling those names) and I would like to perform a shallow crawl (depth of 2 or 3) on those 5 urls searching for other urls that are either PDF files or urls leading to specific domains.

I was hoping that I could write a generic spider, with the start_urls for each of them set to those 5 urls, and if any of them get a hit on a CV/resume then the spider would return the link to that CV/Resume and move on to the new name. I thought Ruia would be a good match since it's a relatively simple problem and it seems the async features would really speed up the runtime, but I couldn't figure out how to configure start_urls so it could be a different set of urls each time.

howie6879 commented 3 years ago

Ruia is a crawler framework, you can use ruia to achieve the functions you said, but not let ruia itself have the functions you define.

This is your custom function, you can use Ruia to implement a general crawler to do what you want

This is my opinion.

JacobJustice commented 3 years ago

I agree with you that the task itself should not be a simple function call away or something like that, as that would defeat the purpose of it remaining as a framework.

I was just expressing that I can't use Ruia as it exists for this purpose at all because start_urls must be always be hardcoded.

howie6879 commented 3 years ago

Why can't you use Ruia, can you give me a code example, I hope to help you by using Ruia