opensanctions / site

This repository is no longer maintained. The web site now contains billing mechanisms etc. which makes it hard to maintain as an open source project.
https://www.opensanctions.org
6 stars 2 forks source link

Crawler contribution docs clarification #376

Open jbothma opened 1 year ago

jbothma commented 1 year ago

Include verbose logging in your crawler. Make sure that new fields or enum values introduced upstream (e.g. a new country code or sanction program) will cause a warning to be emitted.

One way to read this is that to log new countries or sanction programs, a crawler should query for existing countries or programs and log when new ones are being added. Is that right? If so, the Context could be doing that for you., right?

Also, should the reader take the following to mean generally too?

Include verbose logging in your crawler.

I'm guessing you don't mean you want log statements like this:

context.log.info(f"Scraping { first_name } { last_name }")

But I do see things that are probably interesting for a given scraper, like which pages are being fetched. And perhaps logging some data that can't be parsed correctly. Is that more the intent of this?

pudo commented 1 year ago

I think the first paragraph refers to the general idea of making crawlers as brittle as possible: if something unexpected happens, it is much better for the crawler to complain and crash than for it to gloss over the issue. In particular, any log message with a level >= WARN will be stored to the database and we can review it later. So having check points like these is really useful:

https://github.com/opensanctions/opensanctions/blob/main/opensanctions/crawlers/gb_hmt_sanctions.py#L80-L83

Regarding the "verbose" logging: any error message below level info is hidden by default (in practice: log.debug), but you can make them visible by calling opensanctions with the -v flag. That gets super super verbose, though, and to be very honest I do a lot of print() debugging once I know there's an issue....