Open hrubyjan opened 3 years ago
scrapy_log_file.py
needs to be extracted from https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/scrapy_log_file.py to a small library.
Then, we can use it to apply a policy. Here's a sample policy: https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/crawl.py#L136-L169
Related: https://github.com/open-contracting-archive/kingfisher-archive/issues/44
We want this as a library, so that it can also be used by Kingfisher Collect. https://github.com/open-contracting/kingfisher-collect/issues/531
We can test on
These all went wrong in scrape phase, therefore, the task should fail and should not start the process task
If you update Collect, Mexico quien es quien will work again :)
Also, Mexico INAI portal no longer exists in Collect (if you update it).
@hrubyjan Where are the scrapyd log files?
Assigning only for last question for now.
Job context contains reference to a given log. For example you can run such command to get a log for scraping Kyrgyzstan data
curl http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log
{
"job_id": "cadd2904064011ec95d5a8a159689b50",
"spider": "kyrgyzstan",
"pelican_id": 1104,
"process_id": "477",
"scrapy_log": "http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log",
"process_id_pelican": 478,
"pelican_dataset_name": "kyrgyzstan_2021-08-26T07:39:50_212",
"process_data_version": "2021-08-26T07:39:50"
}
I'll add this information to Admin guide
Container files are also in the overlay2 directory.
Can also check the dropped items statistic (following idea from https://github.com/open-contracting/kingfisher-collect/issues/1055)
At the end of each phase of data processing we should evaluate whether it ended well, there is something suspicious or this particular phase failed. For
collect
phase define criteria that will a) prevent a dataset from being published in data registryb) raise a warning but will not prevent dataset from being published
We should not insist on having some criteria if we will not see some meaningful rules