Adding `include_dataset` as the opposite of existing `exclude_dataset`

everplays commented 1 year ago

Hi there, just running this by you before making a PR for it.

We are only interested in a few of the datasets. Right now, the "correct" way is that we should query the catalog to see what is there and remove everything that we are not interested in. However, it would be ideal if we could avoid the call to the catalog api all together.

On top of it: as a cherry on top, it would be nicer if we could add a new env var to change the behavior of indexer as well. This way, we could just import the documents that belong to datasets that we are interested in. It should not make a huge performance boost but it wouldn't hurt either.

I am up for making a PR for the include_dataset parameter (unless you have a better name for it). The indexer stuff: please let me know what you think.

pudo commented 1 year ago

Heya! Appreciate the discussion here a lot. I was also thinking about include_dataset but it does sort of intersect weirdly with the existing scoping of queries in some cases. For context: have you looked at the mechanism for specifying your own manifest YML file (https://www.opensanctions.org/docs/yente/datasets/)? What we've recommended to some customers is to make a custom manifest file like this:

catalogs:
  - url: "https://data.opensanctions.org/datasets/latest/index.json"
    scope: sanctions
    resource_name: entities.ftm.json
datasets:
  - name: usa
     title: USA-related datasets
     datasets:
       - us_ofac_sdn
       - us_ofac_cons
       - us_trade_csl
       - us_bis_denied

What you can see here is that a) the catalog is instructed to import the sanctions dataset, which is smaller than default, and b) you can now run queries like /match/usa on only the listed datasets. (nb: If you'd listed out the datasets individually under scopes in the catalogs section, you'd run into some trouble because the individual datasets use the deduplicated IDs, do you'd have the Putin entity from list A and B both in the index and randomly return one).

I'm a bit loathe to rebuild this via env vars: it saves you one YML file, but then you need some sort of massive nested YML syntax that is less transparent.

All of this, of course, doesn't mean that we cannot have /match/default?include_dataset=us_ofac_sdn&include_dataset=us_ofac_cons to do more ad-hoc scoping.

everplays commented 1 year ago

Hey @pudo. Thanks for the explanation.

Even though we like to run our own instances of yente, we definitely do not want to run a modified version of it. Hence, as long as there is an official way to do this, we are content (with a file or with an env var, it doesn't matter). It sounds like Manifest does what we want so I am closing the issue.

P.S. I can see that with this approach, only the data that we are interested in is going to be stored in opensearch/elasticsearch as well. So this is great.

opensanctions / yente

Adding `include_dataset` as the opposite of existing `exclude_dataset` #320