Add new features - Githubissues

yolile commented 7 months ago

The library currently has:

crawl_time
is_finished
item_counts
spider_arguments
is_complete
error_rate (but using File vs FileError)

As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, https://github.com/open-contracting/kingfisher-collect/issues/531, https://github.com/open-contracting/kingfisher-collect/issues/1058 and https://github.com/open-contracting/kingfisher-collect/issues/1055 I think we need to add:

drop_rate: item_dropped_count vs item_scraped_count
duplicate_request_rate: dupefilter/filtered vs downloader/request_count https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests
invalid_json_rate: invalid_json_count vs item_scraped_count
finish_reason: cancelled, finished, etc as documented https://kingfisher-collect.readthedocs.io/en/latest/logs.html#check-the-reason-for-closing-the-spider
errors_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
errors_list: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-messages
response_status_counts: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-error-response-status-codes
exception_count: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-numbers-of-downloader-exceptions
retry_times_reached: https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-requests-for-which-the-maximum-number-of-retries-was-reached

@jpmckinney any others? Do you agree with implementing all of them?

jpmckinney commented 7 months ago

In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:

decide whether a collection is good (at all) https://github.com/open-contracting-archive/kingfisher-archive/blob/0740a02fe9cbd0b8e74362652511ca33ea039fd2/ocdskingfisherarchive/crawl.py#L136-L169
decide whether a collection is better than another https://github.com/open-contracting-archive/kingfisher-archive/blob/0740a02fe9cbd0b8e74362652511ca33ea039fd2/ocdskingfisherarchive/crawl.py#L257-L317

This might also help decide which of the new methods is most worthwhile to implement.

For example, we might decide to add new methods related to "insufficiently clean".

error_rate (but using File vs FileError)

Do you mean it should be using something else?

(1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem. (2) Where is duplicate request rate from? (3) Maybe we should change the middleware to yield a FileError? That way, invalid JSON gets counted in the error rate. (Invalid JSON does seem like an error.) (4) I think is_finished is all that matters. The reason itself can be accessed as logparser["crawler_stats"] – we don't need a new method for a simple access. (5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts. (7) In principle, these should lead to FileError items. I think these might just be for reporting (e.g. logreport prints those out from logparser["crawler_stats"], without needing a new method in this package). (8) I think this can be covered with 5-6. (9) This is more to debug retries. In terms of collection quality, URLs that reach the max retries end up being FileErrors. Maybe we just report the retry/max_reached and let the user decide whether to investigate.

What do you think?

It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to https://github.com/open-contracting/data-registry/issues/29

yolile commented 7 months ago

(1) ok (2) I've edited the issue, but from https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests (4) Ok

(3) and (5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.

We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner.

(7)(8)(9) ok

jpmckinney commented 7 months ago

Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information.

Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating error_rate, we should include invalid JSON.

open-contracting / scrapy-log-analyzer

Add new features #11