Create unified ES end-point for health scanner

bishax commented 4 years ago

This would unify the three datasets under a common schema, alleviating the need for GraphQL whose cold lambda starts are currently causing poor performance. This will be implemented as a Luigi task that will take data from the latest ES indices for each dataset, apply a remapping and insert them under a unified index (adding a flag for the dataset type).

[ ] Write schema remappings
[ ] Write tasks to apply remappings
- [ ] NiH
- [ ] Crunchbase
- [ ] Meetup

mindrones commented 4 years ago

adding a flag for the dataset type

there should be type_of_entity

Write schema remappings

I'll do this now for the discussion

bishax commented 4 years ago

type_of_entity : meetup, company, project ?

mindrones commented 4 years ago

yes

mindrones commented 4 years ago

This is just comparing CB, MU, NIH for Mosaic (not EURITO).

(EDIT: make sure to scroll the yaml portions, for some reason they have a max height)

identification

id:
  mu:
    id_of_group: string
  nih:
    id_of_project: string
name:
  cb:
    name_of_organisation: string
  mu:
    name_of_group: string
  nih:
    title_of_organisation: string

content

description:
  cb:
    textBody_descriptive_organisation: string
  mu:
    textBody_descriptive_group: string
  nih:
    textBody_descriptive_project: string
brief:
  cb:
    textBody_summary_organisation: string
  nih:
    textBody_abstract_project: string
title:
  nih:
    title_of_project: string

time

start_date:
  cb:
    date_birth_organisation: date # yyyy-MM-dd
  mu:
    date_start_group: date # yyyy-MM-dd
  nih:
    date_start_project: date # yyyy-MM-dd
end_date:
  cb:
    date_death_organisation: date # yyyy-MM-dd
  nih:
    date_end_project: date # yyyy-MM-dd
update_date:
  cb:
    datetime_updated_organisation: date # yyyy-MM-dd HH:mm:ss

geo

continentName:
  cb, nih:
    placeName_continent_organisation: string
  mu:
    placeName_continent_group: string
continentId:
  cb, nih:
    id_of_continent: string
    id_continent_organisation: string
  mu:
    id_continent_group: string
countryId:
  cb:
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
  mu:
    id_country_group: string
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
  nih:
    id_iso2_country: string
    id_iso3_country: string
    id_isoNumeric_country: integer
countryName:
  cb, nih:
    placeName_country_organisation: string
  mu:
    placeName_country_group: string
stateId:
  cb, mu, nih:
    id_state_organisation: string
stateName:
  cb, nih:
    placeName_state_organisation: string
regionName:
  cb:
    placeName_region_organisation: string
city:
  cb, nih:
    placeName_city_organisation: string
  mu:
    placeName_city_group: string
zipcode:
  nih:
    placeName_zipcode_organisation: string
address:
  cb:
    address_of_organisation: string
location:
  cb:
    coordinate_of_city:
      lat: float
      lon: float
  mu:
    coordinate_of_group:
      lat: float
      lon: float
  nih:
    coordinate_of_organisation:
      lat: float
      lon: float

metrics

novelty:
  cb:
    rank_rhodonite_organisation: float
  mu:
    rank_rhodonite_group: float
  nih:
    rank_rhodonite_abstract: float
size:
  cb:
    count_employee_organisation: string # 1-100
  mu:
    count_member_group: integer

classification

type_of_entity:
  cb, mu, nih:
    type_of_entity: string
is_duplicate:
  nih:
    booleanFlag_duplicate_abstract: boolean
is_autotranslated:
  mu:
    booleanFlag_autotranslated_entity: boolean
is_health:
  cb:
    booleanFlag_health_organisation: boolean
terms_mesh:
  cb:
    terms_mesh_description: string[]
  mu:
    terms_mesh_group: string[]
  nih:
    terms_mesh_abstract: string[]
terms_sdg:
  nih:
    terms_sdg_abstract: string[]
terms_place:
  cb, mu, nih:
    terms_of_countryTags: string[]
terms_topics:
  mu:
    terms_topics_group: string[]
  nih:
    terms_descriptive_project: string[]
terms_funders:
  cb, nih:
    terms_of_funders: string[]
terms_language:
  mu:
    terms_iso2lang_entity: string[]

web

url_cb:
  cb:
    url_crunchBase_organisation: string
url_fb:
  cb:
    url_facebook_organisation: string
url_li:
  cb:
    url_linkedIn_organisation: string
url_site:
  cb:
    url_of_organisation: string
  mu:
    url_of_group: string
url_tw:
  cb:
    url_twitter_organisation: string

funding

funding_cost:
  cb:
    cost_of_funding: float
  nih:
    cost_total_project: float
funding_rounds:
  cb:
    count_rounds_funding: integer
  nih:
    json_funding_project:
      []:
        cost_ref: long
        end_date: date
        start_date: date
        year: integer
funding_currency:
  cb:
    currency_of_funding: string
  nih:
    currency_total_cost: string
funding_last_date: # gah..
  cb:
    date_last_funding: date # yyyy-MM-dd
funding_year:
  nih:
    year_fiscal_funding: integer
funding_entity:
  nih:
    title_of_funder: string

# this could become an object (see also `json_funding_project` above)
#funding:
#  cost: float
#  rounds: integer
#  currency: string
#  date_last_funding?: date # yyyy-MM-dd

custom

owner:
  cb:
    id_parent_organisation: string
status:
  cb:
    status_of_organisation: string
alias:
  cb:
    terms_alias_organisation: string[]
terms_category:
  cb:
    terms_category_organisation: string[] # multiple, of a group of known categories
  mu:
    name_of_category: string # single, of a group of known categories
terms_subcategory:
  cb:
    terms_subcategory_organisation: string[]
roles:
  cb:
    terms_roles_organisation: string[]
type:
  cb:
    type_of_organisation: string

unused

cb:
  _cost_usd2018_organisation: float
  _terms_sdg_summary: string[]

mu:
  _id_state_group: string
  _placeName_state_group: string
  _terms_memberOrigin_group: string[]
  _terms_sdg_description: string[]

mindrones commented 4 years ago

Back then, even using aliases the response still contained items with the original, non-aliased, schema (which basically defeats the purpose of aliasing, although helping when composing the query).

As an alternative to this re-mapping, we could investigate if newer versions of ElasticSearch can return items with the aliased schema.

bishax commented 4 years ago

Even if that option was now available I think migrating to a new version of ES would be a larger effort, particularly if there's been any breaking changes. Furthermore, this way reduces the number of queries needing to be made?

bishax commented 4 years ago

id:
  mu:
    id_of_group: string
  nih:
    id_of_project: string

Why not id_parent_organisation for cb?

mindrones commented 4 years ago

That would be the id of the main entity, id_parent_organisation identifies another company in Crunchbase I think.

Btw, if discussing via snippets sounds difficult we can start a branch and review mappings via PR comments?

mindrones commented 4 years ago

Not sure why there is no id_of_organisation for Crunchbase entities.

bishax commented 4 years ago

Is there any documentation for RWJF outside of nestauk/nesta?

bishax commented 4 years ago

I have a branch. I'll push and open a PR when I have a first pass

mindrones commented 4 years ago

I'll push and open a PR when I have a first pass

OK.

Furthermore, this way reduces the number of queries needing to be made?

I don't think so, as by using the alias health_scanner we can query all the endpoints aliased by that alias at the same time (the problem being as we discussed that you get an array of items with the schema from the originating index).

mindrones commented 4 years ago

@jaklinger here's the definitive mapping in CSV format:

new_name,CB,MU,NIH
address,address_of_organisation,,
brief,textBody_summary_organisation,,textBody_abstract_project
continent_id,id_of_continent,id_continent_group,id_of_continent
continent,placeName_continent_organisation,placeName_continent_group,placeName_continent_organisation
country_id,id_iso2_country,id_iso2_country,id_iso2_country
country,placeName_country_organisation,placeName_country_group,placeName_country_organisation
city,placeName_city_organisation,placeName_city_group,placeName_city_organisation
date_end,date_death_organisation,date_end_project,
date_start,date_birth_organisation,date_start_group,date_start_project
date_update,datetime_updated_organisation,,
description,textBody_descriptive_organisation,textBody_descriptive_group,textBody_descriptive_project
funding_cost,cost_of_funding,,cost_total_project
funding_currency,currency_of_funding,,currency_total_cost
funding_rounds,count_rounds_funding,,json_funding_project
funding_year,,,year_fiscal_funding
funder,,,title_of_funder
id,,id_of_group,id_of_project
is_autotranslated,,booleanFlag_autotranslated_entity,
is_duplicate,,,booleanFlag_duplicate_abstract
is_health,booleanFlag_health_organisation,,
location,coordinate_of_city,coordinate_of_group,coordinate_of_organisation
name,name_of_organisation,name_of_group,title_of_organisation
novelty,rank_rhodonite_organisation,rank_rhodonite_group,rank_rhodonite_abstract
parent_id,id_parent_organisation,,
region_name,placeName_region_organisation,,
source,type_of_entity,type_of_entity,type_of_entity
state_id,id_state_organisation,id_state_organisation,id_state_organisation
state,placeName_state_organisation,,placeName_state_organisation
status,status_of_organisation,,
size,count_employee_organisation,count_member_group,
terms_alias,terms_alias_organisation,,
terms_category,terms_category_organisation,,name_of_category
terms_funder,terms_of_funders,,terms_of_funders
terms_lang,,terms_iso2lang_entity,
terms_mesh,terms_mesh_description,terms_mesh_group,terms_mesh_abstract
terms_place,terms_of_countryTags,terms_of_countryTags,terms_of_countryTags
terms_role,terms_roles_organisation,,
terms_sdg,,,terms_sdg_abstract
terms_subcategory,terms_subcategory_organisation,,
terms_topics,,terms_topics_group,terms_descriptive_project
title,,,title_of_project
type,,,type_of_organisation
url_source,url_crunchBase_organisation,url_of_group,
url_fb,url_facebook_organisation,,
url_li,url_linkedIn_organisation,,
url_site,url_of_organisation,,
url_tw,url_twitter_organisation,,
zipcode,,,placeName_zipcode_organisation
<remove>,id_iso3_country,id_iso3_country,id_iso3_country
<remove>,id_isoNumeric_country,id_isoNumeric_country,id_isoNumeric_country
<remove>,id_continent_organisation (dupe),_id_state_group,id_continent_organisation (dupe)
<remove>,date_last_funding,_placeName_state_group,
<remove>,_cost_usd2018_organisation,_terms_memberOrigin_group,
<remove>,_terms_sdg_summary,_terms_sdg_description,

mindrones commented 4 years ago

I've marked some fields for removal as they're duplicate or redundant, temporary or unused, see <remove>

mindrones commented 4 years ago

In the above csv, I've changed dataset into source, with the request to change the current value of type_of_entity into crunchbase, meetup and NIH, so that in arxlive this could be source = arxiv | biorxiv | medrxiv (instead of article_source, for uniformity)

mindrones commented 4 years ago

In the above csv, I've changed alias into terms_alias (I didn't realise it is an array).

nestauk / old_nesta_daps