Open yolile opened 7 months ago
In terms of how ScrapyLogFile is used (for acceptance criteria, etc.), you can see logic at:
This might also help decide which of the new methods is most worthwhile to implement.
For example, we might decide to add new methods related to "insufficiently clean".
error_rate (but using File vs FileError)
Do you mean it should be using something else?
(1) This can be a new reason for not accepting a collection; we just need to decide a threshold. We can maybe start by just storing the value in the job context, and then evaluate. In principle, true duplicates are not necessarily a problem.
(2) Where is duplicate request rate from?
(3) Maybe we should change the middleware to yield a FileError? That way, invalid JSON gets counted in the error rate. (Invalid JSON does seem like an error.)
(4) I think is_finished
is all that matters. The reason itself can be accessed as logparser["crawler_stats"]
– we don't need a new method for a simple access.
(5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.
(7) In principle, these should lead to FileError items. I think these might just be for reporting (e.g. logreport
prints those out from logparser["crawler_stats"]
, without needing a new method in this package).
(8) I think this can be covered with 5-6.
(9) This is more to debug retries. In terms of collection quality, URLs that reach the max retries end up being FileErrors. Maybe we just report the retry/max_reached
and let the user decide whether to investigate.
What do you think?
It looks like 5-6 is probably the heaviest. Other than that, maybe we only need to do (1), in this package? Once the code is ready, we can parse the existing logs and see if we want to add any rules to https://github.com/open-contracting/data-registry/issues/29
(1) ok (2) I've edited the issue, but from https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requests (4) Ok
(3) and (5-6) I'm not sure whether we can do anything automatically with this information. But, it would be useful to collect and report the messages and counts.
We should use this information to report the issues with the partners. If we use FileError for everything we would need a way to differentiate invalid JSON errors from failed requests, etc, so that we can report the issue properly to the partner.
(7)(8)(9) ok
Aha - (2) is maybe more a programming error, whereas (1) is a publisher error. It can be reported for information.
Okay, let's not change (3) to a FileError (as-is, it still gets logged in the crawl log for regular Kingfisher Collect users), but in terms of calculating error_rate
, we should include invalid JSON.
The library currently has:
File
vsFileError
)As per https://kingfisher-collect.readthedocs.io/en/latest/logs.html, https://github.com/open-contracting/kingfisher-collect/issues/531, https://github.com/open-contracting/kingfisher-collect/issues/1058 and https://github.com/open-contracting/kingfisher-collect/issues/1055 I think we need to add:
item_dropped_count
vsitem_scraped_count
dupefilter/filtered
vsdownloader/request_count
https://kingfisher-collect.readthedocs.io/en/latest/logs.html#read-the-number-of-duplicate-requestsinvalid_json_count
vsitem_scraped_count
@jpmckinney any others? Do you agree with implementing all of them?