Closed everplays closed 1 year ago
Heya! Appreciate the discussion here a lot. I was also thinking about include_dataset
but it does sort of intersect weirdly with the existing scoping of queries in some cases. For context: have you looked at the mechanism for specifying your own manifest YML file (https://www.opensanctions.org/docs/yente/datasets/)? What we've recommended to some customers is to make a custom manifest file like this:
catalogs:
- url: "https://data.opensanctions.org/datasets/latest/index.json"
scope: sanctions
resource_name: entities.ftm.json
datasets:
- name: usa
title: USA-related datasets
datasets:
- us_ofac_sdn
- us_ofac_cons
- us_trade_csl
- us_bis_denied
What you can see here is that a) the catalog is instructed to import the sanctions
dataset, which is smaller than default
, and b) you can now run queries like /match/usa
on only the listed datasets. (nb: If you'd listed out the datasets individually under scopes
in the catalogs
section, you'd run into some trouble because the individual datasets use the deduplicated IDs, do you'd have the Putin entity from list A and B both in the index and randomly return one).
I'm a bit loathe to rebuild this via env vars: it saves you one YML file, but then you need some sort of massive nested YML syntax that is less transparent.
All of this, of course, doesn't mean that we cannot have /match/default?include_dataset=us_ofac_sdn&include_dataset=us_ofac_cons
to do more ad-hoc scoping.
Hey @pudo. Thanks for the explanation.
Even though we like to run our own instances of yente, we definitely do not want to run a modified version of it. Hence, as long as there is an official way to do this, we are content (with a file or with an env var, it doesn't matter). It sounds like Manifest
does what we want so I am closing the issue.
P.S. I can see that with this approach, only the data that we are interested in is going to be stored in opensearch/elasticsearch as well. So this is great.
Hi there, just running this by you before making a PR for it.
We are only interested in a few of the datasets. Right now, the "correct" way is that we should query the catalog to see what is there and remove everything that we are not interested in. However, it would be ideal if we could avoid the call to the catalog api all together.
On top of it: as a cherry on top, it would be nicer if we could add a new env var to change the behavior of indexer as well. This way, we could just import the documents that belong to datasets that we are interested in. It should not make a huge performance boost but it wouldn't hurt either.
I am up for making a PR for the
include_dataset
parameter (unless you have a better name for it). The indexer stuff: please let me know what you think.