open-contracting / scrapy-log-analyzer

Provides methods to analyze the quality of a Scrapy crawl
https://scrapy-log-analyzer.readthedocs.io
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Add new features #11

Open yolile opened 4 months ago

yolile commented 4 months ago

The library currently has:

As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, https://github.com/open-contracting/kingfisher-collect/issues/531, https://github.com/open-contracting/kingfisher-collect/issues/1058 and https://github.com/open-contracting/kingfisher-collect/issues/1055 I think we need to add:

  1. drop_rate: item_dropped_count vs item_scraped_count
  2. duplicate_request_rate: dupefilter/filtered vs downloader/request_count https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
  3. invalid_json_rate: invalid_json_count vs item_scraped_count
  4. finish_reason: cancelled, finished, etc as documented https://kingfisher-collect.readthedocs.io/en/latest/logs.html#check-the-reason-for-closing-the-spider
  5. errors_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
  6. errors_list: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
  7. response_status_counts: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-response-status-codes
  8. exception_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-downloader-exceptions
  9. retry_times_reached: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-requests-for-which-the-maximum-number-of-retries-was-reached

@jpmckinney any others? Do you agree with implementing all of them?

jpmckinney commented 4 months ago

In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:

This might also help decide which of the new methods is most worthwhile to implement.

For example, we might decide to add new methods related to "insufficiently clean".


error_rate (but using File vs FileError)

Do you mean it should be using something else?

(1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem. (2) Where is duplicate request rate from? (3) Maybe we should change the middleware to yield a FileError? That way, invalid JSON gets counted in the error rate. (Invalid JSON does seem like an error.) (4) I think is_finished is all that matters. The reason itself can be accessed as logparser["crawler_stats"] – we don't need a new method for a simple access. (5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts. (7) In principle, these should lead to FileError items. I think these might just be for reporting (e.g. logreport prints those out from logparser["crawler_stats"], without needing a new method in this package). (8) I think this can be covered with 5-6. (9) This is more to debug retries. In terms of collection quality, URLs that reach the max retries end up being FileErrors. Maybe we just report the retry/max_reached and let the user decide whether to investigate.

What do you think?

It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to https://github.com/open-contracting/data-registry/issues/29

yolile commented 4 months ago

(1) ok (2) I've edited the issue, but from https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests (4) Ok

(3) and (5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.

We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner.

(7)(8)(9) ok

jpmckinney commented 4 months ago

Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information.

Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating error_rate, we should include invalid JSON.