Reduce number of requests made in partial scrapes

okfn-brasil / querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.

https://queridodiario.ok.org.br/

MIT License

1.1k stars 407 forks source link

Reduce number of requests made in partial scrapes #247

Closed rennerocha closed 2 years ago

rennerocha commented 4 years ago

In order to check the execution time of the spiders, I deployed the project on Scrapy Cloud and selected manually a few spiders.

During my tests I did the follow:

Disabled convert PDF to text step (as Scrapy Cloud doesn't allows us to configure pdf2txt in a free tier);
Took advantage of GazetteDateFilteringPipeline and set start_date to avoid downloading more gazettes than we want.

Results were not good. Even if we restrict the date of the gazettes that we want to download, the number of unnecessary requests made the spider take longer than accepted.

Considering that each city site is different, I couldn't think a general solution that may be applied to all spiders to reduce the number of requests other than changing every spider and including some internal logic that will filter the requests based on how data is structured there.

spider	elapsed (min)	date
sp_jundiai	49	all
sp_jundiai	13	last day
ba_salvador	33	last day
to_palmas	42	current month

Is there any other option to improve this?

vitorbaptista commented 4 years ago

Do we know where this time is spent on? Are there any unecessary requests being made? Maybe a quick and dirty way to find potential spiders to optimize is looking for the ratio between number of requests and number of items scraped. To scrape a single gazette, how many requests are needed? If a scraper needs a high number of requests per gazette it might mean that it's doing unecessary requests.

rennerocha commented 4 years ago

Do we know where this time is spent on? Are there any unnecessary requests being made? Maybe a quick and dirty way to find potential spiders to optimize is looking for the ratio between number of requests and number of items scraped. To scrape a single gazette, how many requests are needed? If a scraper needs a high number of requests per gazette it might mean that it's doing unnecessary requests.

We definitely send unnecessary requests. I would like to fix it without changing each spider, but I don't think it will be possible given that each city organize its gazettes in a different way.

The pattern that we use to develop all spiders crawl the entire site for all items and yield them. We avoid downloading the gazettes based using an Item Pipeline that drops items older than a specified data (start_date argument). It is fine when doing a full scrape but is costly for partial scrapes.

I liked the idea of creating a ratio items_returned / requests_made. It can be a monitoring tool that will help us to identify problems but it won't solve the main problem.

ejulio commented 4 years ago

I see two alternatives here:

Using the same codebase, just change the place to check the date. Instead of using a Pipeline, use a SpiderMiddleware and process_spider_output. If the output is an Item, perform the date check and return/drop it. Once the first Item was dropped, stop returning requests. Note that this is only valid if the spiders have some sort of sequence, otherwise, you could've scheduled some requests that won't be required. If it happened to schedule unnecessary requests, then you can write a DownloaderMiddleware and raise IgnoreRequest in process_request (as this runs when the request leaves the Schedules and is going to the Downloader) after the first Item was dropped because of old dates.
Use some sort of external storage, such as HCF and filter requests by their fingerprints. This way, we can always handle incremental crawls and do not need to handle dates. Though, we require this external storage feature. HCF is an example, Frontera could be used, but would require to setup the whole infrastructure.

Hope the ideas are clear and make sense

victor-torres commented 4 years ago

Can't you use some kind of cache [middleware] or memorization on RAM? Another option is using Collections API to store and fetch previous items.

jvanz commented 4 years ago

2. Use some sort of external storage, such as [HCF](https://blog.scrapinghub.com/scrapy-cloud-secrets-hub-crawl-frontier-and-how-to-use-it) and filter requests by their fingerprints.
   This way, we can always handle incremental crawls and do not need to handle dates.
   Though, we require this external storage feature. HCF is an example, Frontera could be used, but would require to setup the whole infrastructure.

I think that's what I was trying to explain in the previous comments on issues #219 and #172. Find a way to generate some unique ID for each request and somehow drop the request if it had been done in the past. Thus, we do not need to change the spider individually.

rennerocha commented 4 years ago

Any solution that relies on storing information about the request will fail when we are working with pages like this (basically any site with pagination):

https://imprensaoficial.jundiai.sp.gov.br/ jundiai

Request fingerprint will be the same (same URL with no parameters), but every day its content will change, so we need to request it in order to decide if we want to get the items available or not (so we can drop them using the already existing pipeline).

In this case, we should be able to send this first request and based on its content, we decide if we continue following the pagination or not. This requires a knowledge on how the response content is and varies for different spiders.

Another kind of spider is rn_natal. We are able to send requests for specific months, so if we have a start_date we could update start_requests and send only the ones that are necessary for the dates we are interested avoiding extra requests.

Unfortunately I can't see any other solution than working case to case and updating the spiders using the features available to filter the requests.

ejulio commented 4 years ago

So, solution 1. I shared doesn't rely on any fingerprint, and should handle the cases, given there is some sort of order to it. If you spawn requests for all months in start_requests, then it will work partially. Though, it's a matter of updating only the spider that fail to follow this API.

Solution 2 should also work, but you'll need to update the fingerprint method, to account for some other factor. Or, you can send dont_filter and in that case, even if the fingerprint was already processed, it will be processed again..

rennerocha commented 4 years ago

Solution 1 requires that we have some "order" in the requests, which is not desirable (as we want to process as many requests in parallel as possible to make things faster). Changing the API of the spiders is acceptable, but certainly we will need to work case to case, so a big change in all spiders will be required.

Solution 2 may work but it will require a specific fingerprint for each spiders (as we will need to handle generate them specific to the site we are working), so we will need to change the spiders case to case as well.

In the end, I don't see how to solve this problem without working on each spider and trying to improve them case to case :stuck_out_tongue: And as we will review the code of each spider, we will be able to introduce a smarter logic inside them instead of doing hacks to make them work on middlewares.

ogecece commented 4 years ago

I really liked Solution 1 proposed by @ejulio . I sketched something that could be done about it making a minimal amount of unnecessary requests without sacrificing performance. But that would require the scraper to create the requests in ascending or descending date order at least. This would be important because that way we could do kind of a binary search in the created requests messing with the scheduler and discard the unnecessary ones mid-crawl. I didn't give any more thought than that because it would require refactoring anyway and I don't know how that could play out with pagination or some unconventional gazette display order.

One other thing that came to mind, since we are talking about refactoring anyway, inserting a tag like gazette_date in the meta whenever we have this kind of information and than filtering them out with a middleware (possibly solving #172 with an end_date attribute). Maybe could be easier to refactor and I think it is beginner-friendly. That would require informing contributors to always insert this tag whenever we have this kind of information displayed, though. What do you think?

rennerocha commented 4 years ago

I really liked Solution 1 proposed by @ejulio . I sketched something that could be done about it making a minimal amount of unnecessary requests without sacrificing performance. But that would require the scraper to create the requests in ascending or descending date order at least. This would be important because that way we could do kind of a binary search in the created requests messing with the scheduler and discard the unnecessary ones mid-crawl. I didn't give any more thought than that because it would require refactoring anyway and I don't know how that could play out with pagination or some unconventional gazette display order.

This solution you are proposing may be suitable for some sites and not suitable for others. Some sites are more easily filtered when we have some information about the dates that will be returned on that request (for example, the ones that send a POST request passing month/year parameters). Pagination is a bit more complicated and probably we will have to decide whether request the next page only after checking the items that were returned. So it will depend how the site display the gazettes so we can chose the best solution for that case. There is no one-size-fits-all solution

One other thing that came to mind, since we are talking about refactoring anyway, inserting a tag like gazette_date in the meta whenever we have this kind of information and than filtering them out with a middleware (possibly solving #172 with an end_date attribute). Maybe could be easier to refactor and I think it is beginner-friendly. That would require informing contributors to always insert this tag whenever we have this kind of information displayed, though. What do you think?

We already have a date filter in place. start_date argument can be passed to the spider and actually is used by GazetteDateFilteringPipeline to avoid downloading gazettes from dates we don't want. We can use that information to make the spiders smarter and decide if they want to send a new request or not.

In my opinion, the way to go is revisit all spiders and improve their code to use start_date to help filtering inside the spider. I know that the spiders will be a bit more complicated with more logic necessary before sending new requests but we need to focus on how to make the spiders useful and ready for daily execution rather than being beginner-friendly (which will not affect how we help beginners to start contributing with the project).

We can start reviewing all existing capitals spiders, so we can send them to production faster.

ogecece commented 4 years ago

We already have a date filter in place. start_date argument can be passed to the spider and actually is used by GazetteDateFilteringPipeline to avoid downloading gazettes from dates we don't want. We can use that information to make the spiders smarter and decide if they want to send a new request or not.

But that only avoids downloading the documents based on the item. I'm suggesting that we could add the date (if it is available) in the meta of the requests themselves.

If we use Jundiaí as an example, the spider produces three types of output:

pagination requests
page detail requests
items

The items are already covered by the pipeline. Since the website displays the date information, we could cover the page detail requests with this method and maybe the pagination too, depending on the implementation.

Another example would be Natal, where all requests are made at the start and we could filter them too inserting a date at the meta and checking against start_date.

In my opinion, the way to go is revisit all spiders and improve their code to use start_date to help filtering inside the spider. I know that the spiders will be a bit more complicated with more logic necessary before sending new requests but we need to focus on how to make the spiders useful and ready for daily execution rather than being beginner-friendly (which will not affect how we help beginners to start contributing with the project).

Sure. Since you are already doing those kind of tasks and looking at the spiders right now, I think you have a better perspective if this proposal would be applicable or not and faster or slower to implement.

Thinking about it, maybe dates in the requests could provide room for some other stuff. Maybe that's the kind of information we could use in an external storage to later implement cache as discussed in #219.

rennerocha commented 2 years ago

The use of start_date and end_date, together with monitors that check # requests / # items returned and proper code review should be enough to keep the amount of requests low in the majority of spiders.