open-contracting / data-registry

BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Acceptance criteria - Kingfisher Collect #29

Open hrubyjan opened 3 years ago

hrubyjan commented 3 years ago

At the end of each phase of data processing we should evaluate whether it ended well, there is something suspicious or this particular phase failed. For collect phase define criteria that will a) prevent a dataset from being published in data registry

b) raise a warning but will not prevent dataset from being published

We should not insist on having some criteria if we will not see some meaningful rules

jpmckinney commented 3 years ago

scrapy_log_file.py needs to be extracted from https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/scrapy_log_file.py to a small library.

Then, we can use it to apply a policy. Here's a sample policy: https://github.com/open-contracting-archive/kingfisher-archive/blob/main/ocdskingfisherarchive/crawl.py#L136-L169

Related: https://github.com/open-contracting-archive/kingfisher-archive/issues/44

We want this as a library, so that it can also be used by Kingfisher Collect. https://github.com/open-contracting/kingfisher-collect/issues/531

hrubyjan commented 3 years ago

We can test on

These all went wrong in scrape phase, therefore, the task should fail and should not start the process task

jpmckinney commented 3 years ago

If you update Collect, Mexico quien es quien will work again :)

jpmckinney commented 3 years ago

Also, Mexico INAI portal no longer exists in Collect (if you update it).

jpmckinney commented 3 years ago

@hrubyjan Where are the scrapyd log files?

jpmckinney commented 3 years ago

Assigning only for last question for now.

hrubyjan commented 3 years ago

Job context contains reference to a given log. For example you can run such command to get a log for scraping Kyrgyzstan data curl http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log

{
    "job_id": "cadd2904064011ec95d5a8a159689b50",
    "spider": "kyrgyzstan",
    "pelican_id": 1104,
    "process_id": "477",
    "scrapy_log": "http://localhost:6800/logs/kingfisher/kyrgyzstan/cadd2904064011ec95d5a8a159689b50.log",
    "process_id_pelican": 478,
    "pelican_dataset_name": "kyrgyzstan_2021-08-26T07:39:50_212",
    "process_data_version": "2021-08-26T07:39:50"
}
hrubyjan commented 3 years ago

I'll add this information to Admin guide

jpmckinney commented 3 years ago

Container files are also in the overlay2 directory.

jpmckinney commented 1 year ago

Related: https://github.com/open-contracting/kingfisher-collect/issues/531

jpmckinney commented 6 months ago

Can also check the dropped items statistic (following idea from https://github.com/open-contracting/kingfisher-collect/issues/1055)