usc-isi-i2 / datamart-api

MIT License
1 stars 2 forks source link

Filter datasets/variables based on country #45

Open saggu opened 4 years ago

saggu commented 4 years ago

This is according to this chat

image

Currently we do not have this information in the variable metadata. But this feature could become important

zmbq commented 4 years ago

So we want the fuzzy search results to be further screened by country - only variables that have data for that country should be displayed? (or actually, if we filter, we can filter by any admin level)

saggu commented 4 years ago

We will try this out at search time first to see how fast the API is

zmbq commented 4 years ago

Implemented in the itay/fuzzy-search-admins. Add ?country= or ?country_id= arguments to the fuzzy search endpoint. You can add multiple countries, and they will be ORed. A variable is returned from the fuzzy search if it has at least one datapoint for one of the countries in the arguments.

Since this is implemented with a materielized view, you need to create it before you can test. Run python script/create_search_views.py . If you want to test the search after uploading new data, you will need to run python script/refresh_search_views.py . Both can take around 10 minutes to complete.

Only country level filtering is supported at this point, you cannot filter on admin1, admin2 or admin3.

saggu commented 4 years ago

I tested this,

/metadata/variables?keyword=un&country=Ethiopia works and it is very fast.

However

/metadata/variables?country=Ethiopia does not work.

I get this error,

{"Error": "A variable query must be provided: keyword"}

They are going to want to search by only country.

Also this query,

/metadata/variables?keyword=crop&country=Ethiopia&country=Gambia

throws this error

{"Error": [["No country Gambia"]]}`

Instead of returning that, it should return variables which have Ethiopia in them.

If there are no matching variables, it should return an empty table and not an error

zmbq commented 4 years ago

The fuzzy search is limited to 10 responses - querying variables that have a specific country in their data, without any filtering on the name is going to return a lot more than that. I can make the limit larger, but I'm not sure it will be helpful. There are 1,300 variables with the US, for example. and hundreds with pretty much every other country.

As for returning an error when specifying a non-existing country - this is the same behavior we have in the get variable data endpoint. We wanted to distinguish the case where there is no data (or variables in our case) for that country, from the case where you used a non-existent country. I think the behavior should be consistent throughout the system. What do you think?

kyao commented 4 years ago

For non-existing country I think returning an error is reasonable.

The Wikidata name for Gambia is The Gambia. This query works as expected: /metadata/variables?keyword=crop&country=Ethiopia&country=The Gambia

kyao commented 4 years ago

@saggu I just pushed changes that allow keywords to be missing. And, an optional result limit is added. The default is 100. https://github.com/usc-isi-i2/datamart-api/commit/8e48efea9041a53434b2f4f0f618f35176e7fdce

saggu commented 4 years ago

@kyao Tested and works fine. I have deployed it for WM.

Now admin1, admin2 and admin3 based search is remaining

saggu commented 4 years ago

I'll send a kgtk exploded file with admin3 to test this

zmbq commented 4 years ago

@saggu , can you please add the file with admin3?

saggu commented 4 years ago

Waiting for tomorrow's meeting (Sep 16) to see if we even need this functionality. Will update. Moving to ToDo

kyao commented 4 years ago

Here is a dataset with admin3. Both tsv files in the zip file are needed.

census-partial.zip

zmbq commented 4 years ago

Pushed into development. You need to rerun python script/create_search_views.py to create the admin fuzzy search views. If you upload new data, you will need to runpython script/refresh_search_views.py`

saggu commented 4 years ago

Ok, here are the steps I followed,

  1. Create dataset using the API (census-partial.zip)
  2. import kgtk-edges.tsv file using the import script
  3. Refreshed search views using the script
  4. /metadata/variables?country=Ethiopia works (search by country)
  5. /metadata/variables?admin=oromia also works, however
    • /metadata/variables?admin1=oromia
    • /metadata/variables?admin2=oromia
    • /metadata/variables?admin3=oromia does not work

Are we supposed to search like this @zmbq (using admin and not admin1, admin2 or admin3.

Also, if I do not refresh the views, the API throws an error. We should either error handle that or refresh views automatically every time new data is ingested. Suggestions @szeke @kyao

zmbq commented 4 years ago

You need to import dataset-edges.tsv, too. Also, the admin's name is called oromia region I think.

As for having to create the views - this is until we're finished with this issue, then I'll create a database backup with the views and the data.

saggu commented 4 years ago

@zmbq my question is Are we supposed to search with admin=<some admin> ? or using admin1, admin2 or admin3 ?