Open bishax opened 4 years ago
adding a flag for the dataset type
there should be type_of_entity
Write schema remappings
I'll do this now for the discussion
type_of_entity
: meetup
, company
, project
?
yes
This is just comparing CB, MU, NIH for Mosaic (not EURITO).
(EDIT: make sure to scroll the yaml portions, for some reason they have a max height)
id:
mu:
id_of_group: string
nih:
id_of_project: string
name:
cb:
name_of_organisation: string
mu:
name_of_group: string
nih:
title_of_organisation: string
description:
cb:
textBody_descriptive_organisation: string
mu:
textBody_descriptive_group: string
nih:
textBody_descriptive_project: string
brief:
cb:
textBody_summary_organisation: string
nih:
textBody_abstract_project: string
title:
nih:
title_of_project: string
start_date:
cb:
date_birth_organisation: date # yyyy-MM-dd
mu:
date_start_group: date # yyyy-MM-dd
nih:
date_start_project: date # yyyy-MM-dd
end_date:
cb:
date_death_organisation: date # yyyy-MM-dd
nih:
date_end_project: date # yyyy-MM-dd
update_date:
cb:
datetime_updated_organisation: date # yyyy-MM-dd HH:mm:ss
continentName:
cb, nih:
placeName_continent_organisation: string
mu:
placeName_continent_group: string
continentId:
cb, nih:
id_of_continent: string
id_continent_organisation: string
mu:
id_continent_group: string
countryId:
cb:
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
mu:
id_country_group: string
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
nih:
id_iso2_country: string
id_iso3_country: string
id_isoNumeric_country: integer
countryName:
cb, nih:
placeName_country_organisation: string
mu:
placeName_country_group: string
stateId:
cb, mu, nih:
id_state_organisation: string
stateName:
cb, nih:
placeName_state_organisation: string
regionName:
cb:
placeName_region_organisation: string
city:
cb, nih:
placeName_city_organisation: string
mu:
placeName_city_group: string
zipcode:
nih:
placeName_zipcode_organisation: string
address:
cb:
address_of_organisation: string
location:
cb:
coordinate_of_city:
lat: float
lon: float
mu:
coordinate_of_group:
lat: float
lon: float
nih:
coordinate_of_organisation:
lat: float
lon: float
novelty:
cb:
rank_rhodonite_organisation: float
mu:
rank_rhodonite_group: float
nih:
rank_rhodonite_abstract: float
size:
cb:
count_employee_organisation: string # 1-100
mu:
count_member_group: integer
type_of_entity:
cb, mu, nih:
type_of_entity: string
is_duplicate:
nih:
booleanFlag_duplicate_abstract: boolean
is_autotranslated:
mu:
booleanFlag_autotranslated_entity: boolean
is_health:
cb:
booleanFlag_health_organisation: boolean
terms_mesh:
cb:
terms_mesh_description: string[]
mu:
terms_mesh_group: string[]
nih:
terms_mesh_abstract: string[]
terms_sdg:
nih:
terms_sdg_abstract: string[]
terms_place:
cb, mu, nih:
terms_of_countryTags: string[]
terms_topics:
mu:
terms_topics_group: string[]
nih:
terms_descriptive_project: string[]
terms_funders:
cb, nih:
terms_of_funders: string[]
terms_language:
mu:
terms_iso2lang_entity: string[]
url_cb:
cb:
url_crunchBase_organisation: string
url_fb:
cb:
url_facebook_organisation: string
url_li:
cb:
url_linkedIn_organisation: string
url_site:
cb:
url_of_organisation: string
mu:
url_of_group: string
url_tw:
cb:
url_twitter_organisation: string
funding_cost:
cb:
cost_of_funding: float
nih:
cost_total_project: float
funding_rounds:
cb:
count_rounds_funding: integer
nih:
json_funding_project:
[]:
cost_ref: long
end_date: date
start_date: date
year: integer
funding_currency:
cb:
currency_of_funding: string
nih:
currency_total_cost: string
funding_last_date: # gah..
cb:
date_last_funding: date # yyyy-MM-dd
funding_year:
nih:
year_fiscal_funding: integer
funding_entity:
nih:
title_of_funder: string
# this could become an object (see also `json_funding_project` above)
#funding:
# cost: float
# rounds: integer
# currency: string
# date_last_funding?: date # yyyy-MM-dd
owner:
cb:
id_parent_organisation: string
status:
cb:
status_of_organisation: string
alias:
cb:
terms_alias_organisation: string[]
terms_category:
cb:
terms_category_organisation: string[] # multiple, of a group of known categories
mu:
name_of_category: string # single, of a group of known categories
terms_subcategory:
cb:
terms_subcategory_organisation: string[]
roles:
cb:
terms_roles_organisation: string[]
type:
cb:
type_of_organisation: string
cb:
_cost_usd2018_organisation: float
_terms_sdg_summary: string[]
mu:
_id_state_group: string
_placeName_state_group: string
_terms_memberOrigin_group: string[]
_terms_sdg_description: string[]
Back then, even using aliases the response still contained items with the original, non-aliased, schema (which basically defeats the purpose of aliasing, although helping when composing the query).
As an alternative to this re-mapping, we could investigate if newer versions of ElasticSearch can return items with the aliased schema.
Even if that option was now available I think migrating to a new version of ES would be a larger effort, particularly if there's been any breaking changes. Furthermore, this way reduces the number of queries needing to be made?
id: mu: id_of_group: string nih: id_of_project: string
Why not id_parent_organisation
for cb?
That would be the id
of the main entity, id_parent_organisation
identifies another company in Crunchbase I think.
Btw, if discussing via snippets sounds difficult we can start a branch and review mappings via PR comments?
Not sure why there is no id_of_organisation
for Crunchbase entities.
Is there any documentation for RWJF outside of nestauk/nesta
?
I have a branch. I'll push and open a PR when I have a first pass
I'll push and open a PR when I have a first pass
OK.
Furthermore, this way reduces the number of queries needing to be made?
I don't think so, as by using the alias health_scanner
we can query all the endpoints aliased by that alias at the same time (the problem being as we discussed that you get an array of items with the schema from the originating index).
@jaklinger here's the definitive mapping in CSV format:
new_name,CB,MU,NIH
address,address_of_organisation,,
brief,textBody_summary_organisation,,textBody_abstract_project
continent_id,id_of_continent,id_continent_group,id_of_continent
continent,placeName_continent_organisation,placeName_continent_group,placeName_continent_organisation
country_id,id_iso2_country,id_iso2_country,id_iso2_country
country,placeName_country_organisation,placeName_country_group,placeName_country_organisation
city,placeName_city_organisation,placeName_city_group,placeName_city_organisation
date_end,date_death_organisation,date_end_project,
date_start,date_birth_organisation,date_start_group,date_start_project
date_update,datetime_updated_organisation,,
description,textBody_descriptive_organisation,textBody_descriptive_group,textBody_descriptive_project
funding_cost,cost_of_funding,,cost_total_project
funding_currency,currency_of_funding,,currency_total_cost
funding_rounds,count_rounds_funding,,json_funding_project
funding_year,,,year_fiscal_funding
funder,,,title_of_funder
id,,id_of_group,id_of_project
is_autotranslated,,booleanFlag_autotranslated_entity,
is_duplicate,,,booleanFlag_duplicate_abstract
is_health,booleanFlag_health_organisation,,
location,coordinate_of_city,coordinate_of_group,coordinate_of_organisation
name,name_of_organisation,name_of_group,title_of_organisation
novelty,rank_rhodonite_organisation,rank_rhodonite_group,rank_rhodonite_abstract
parent_id,id_parent_organisation,,
region_name,placeName_region_organisation,,
source,type_of_entity,type_of_entity,type_of_entity
state_id,id_state_organisation,id_state_organisation,id_state_organisation
state,placeName_state_organisation,,placeName_state_organisation
status,status_of_organisation,,
size,count_employee_organisation,count_member_group,
terms_alias,terms_alias_organisation,,
terms_category,terms_category_organisation,,name_of_category
terms_funder,terms_of_funders,,terms_of_funders
terms_lang,,terms_iso2lang_entity,
terms_mesh,terms_mesh_description,terms_mesh_group,terms_mesh_abstract
terms_place,terms_of_countryTags,terms_of_countryTags,terms_of_countryTags
terms_role,terms_roles_organisation,,
terms_sdg,,,terms_sdg_abstract
terms_subcategory,terms_subcategory_organisation,,
terms_topics,,terms_topics_group,terms_descriptive_project
title,,,title_of_project
type,,,type_of_organisation
url_source,url_crunchBase_organisation,url_of_group,
url_fb,url_facebook_organisation,,
url_li,url_linkedIn_organisation,,
url_site,url_of_organisation,,
url_tw,url_twitter_organisation,,
zipcode,,,placeName_zipcode_organisation
<remove>,id_iso3_country,id_iso3_country,id_iso3_country
<remove>,id_isoNumeric_country,id_isoNumeric_country,id_isoNumeric_country
<remove>,id_continent_organisation (dupe),_id_state_group,id_continent_organisation (dupe)
<remove>,date_last_funding,_placeName_state_group,
<remove>,_cost_usd2018_organisation,_terms_memberOrigin_group,
<remove>,_terms_sdg_summary,_terms_sdg_description,
I've marked some fields for removal as they're duplicate or redundant, temporary or unused, see <remove>
In the above csv
, I've changed dataset
into source
, with the request to change the current value of type_of_entity
into crunchbase
, meetup
and NIH
, so that in arxlive this could be source
= arxiv
| biorxiv
| medrxiv
(instead of article_source
, for uniformity)
In the above csv
, I've changed alias
into terms_alias
(I didn't realise it is an array).
This would unify the three datasets under a common schema, alleviating the need for GraphQL whose cold lambda starts are currently causing poor performance. This will be implemented as a Luigi task that will take data from the latest ES indices for each dataset, apply a remapping and insert them under a unified index (adding a flag for the dataset type).