howie6879 / ruia

Async Python 3.6+ web scraping micro-framework based on asyncio
https://www.howie6879.com/ruia/
Apache License 2.0
1.75k stars 181 forks source link

Scrape multiple websites & save results in a database [Question] #59

Closed hubitor closed 5 years ago

hubitor commented 5 years ago

What are the possible options for scraping multiple websites e.g. through a list or a file and saving the results in a database?

howie6879 commented 5 years ago

Can you post a details question? May be this script will help you :https://github.com/howie6879/ruia/blob/master/examples/topics_examples/hacker_news_spider.py

howie6879 commented 5 years ago

If you want to save the results in mongodb there is a example:

click here

hubitor commented 5 years ago

I've never used asyncio or any of the asynchronous libraries in python before and from what I've read, the normal libraries cannot be used with asyncio, that's why I'm asking. I don't have some code yet I could post. I'm just investigating the options.

Can you post a details question? May be this script will help you

What do you mean by this? Would it be possible to put some thousand URLs in the start_urls list? Would it be possible or make sense to combine Celery with ruia?

So there is currently support only for mongodb?

howie6879 commented 5 years ago

If there are too many links crawled, I suggest putting them in the parse function for crawling.

Motor_ruia is a plugin that I wrote, for other databases, just use a supported third-party library,for example, mysql can consider using aiomysql,more lib you can click here

Shouldn't it be up to the user whether it makes sense to combine Celery with Ruia? do you think it makes sense to have the combination of celery and Asyncio?

howie6879 commented 5 years ago

If you are interested in asynchronous, I think asyncio is a good choice.

hubitor commented 5 years ago

OK, Thanks. I'll look into it.