kg-construct / best-practices

1 stars 0 forks source link

Generate Transforms and Shapes from Models #7

Open dachafra opened 3 years ago

dachafra commented 3 years ago

issue: Creating transformations is tedious, requires specialized knowledge, and doesn't visualize the shape. YARRML and ONTOP OBDA provide great "shorthand" notations for faster writing, but still don't generate visualizations or RDF shapes.

suggestion: Generate transforms from models. Cover "typical" transforms ("80% case")

link: http://github.com/vladimiralexiev/rdf2rml generates PlantUML diagrams, R2RML transforms, and TARQL conversion queries from Turtle examples

dachafra commented 2 years ago

IMO this is more a "best-practice" or something that can be developed/researched after the specs will be ready, but independently from the CG. What is your opinion @bjdmeest? Shall we close this?

bjdmeest commented 2 years ago

I agree, I feel in this CG we need to agree on what kind of transformations the language should support, and how to describe them, not the tooling to help make those descriptions

VladimirAlexiev commented 2 years ago

@dachafra @bjdmeest sorry I haven't attended CG meetings in the last year :-( Please share this below with the CG.

Here's an example from a current project to whet your appetite. Take the Crunchbase organizations table. We model it as RDF like this:

<cb/agent/(uuid)> a k2:Organization;
  k2:cbId '(uuid)';
  k2:name '(name)';
  k2:cbPermalink '(permalink)';
  k2:cbUrl '(cb_url)'^^xsd:anyURI;
  k2:rank '(rank)'^^xsd:integer;
  k2:createdAt 'fixDate(created_at)'^^xsd:dateTime;
  k2:updatedAt 'fixDate(updated_at)'^^xsd:dateTime;
  k2:legalName '(legal_name)';
  k2:organizationRole <cb/organizationRole/urlify(split1(roles))>;
  k2:domain '(domain)';
  k2:homepageUrl '(homepage_url)'^^xsd:anyURI;
  k2:countryCode '(country_code)';
  k2:stateCode '(state_code)';
  k2:region '(region)';
  k2:city '(city)';
  k2:address '(address)';
  k2:postalCode '(postal_code)';
  k2:status <cb/organizationStatus/urlify(status)>;
  k2:shortDescription '(short_description)';
  k2:industry <cb/industry/urlify(split1(category_list))>;
  k2:numFundingRounds '(num_funding_rounds)'^^xsd:integer;
  k2:totalFundingUsd '(total_funding_usd)'^^xsd:decimal;
  k2:totalFunding '(total_funding)'^^xsd:decimal;
  k2:totalFundingCurrencyCode '(total_funding_currency_code)';
  k2:foundedOn 'fixDate(founded_on)'^^xsd:dateTime;
  k2:lastFundingOn 'fixDate(last_funding_on)'^^xsd:dateTime;
  k2:closedOn 'fixDate(closed_on)'^^xsd:dateTime;
  k2:employeeCount <cb/employeeCount/urlify(ifNotNull(employee_count))>;
  k2:email '(email)';
  k2:phone '(phone)';
  k2:facebookUrl '(facebook_url)'^^xsd:anyURI;
  k2:linkedinUrl '(linkedin_url)'^^xsd:anyURI;
  k2:twitterUrl '(twitter_url)'^^xsd:anyURI;
  k2:logoUrl '(logo_url)'^^xsd:anyURI;
  k2:alias '(alias1)';
  k2:alias '(alias2)';
  k2:alias '(alias3)';
  k2:primaryRole <cb/organizationRole/urlify(primary_role)>;
  k2:numExits '(num_exits)'^^xsd:integer.

It's one of 17 tables depicted as the following diagram: image

From these 17 models, we generate OntoRefine loading & update scripts:

delete {graph ?GRAPH {?s ?p ?o}}
where {
  service <rdf-mapper:ontorefine:PROJECT_ID> {
    bind(?c_updated_at as ?c_updated_at1)
    bind(iri(concat("cb/graph/organizations/",?c_uuid)) as ?GRAPH)}
  <cb> k2:updatedAt ?UPDATED_AT_DATETIME.
  bind(replace(str(?UPDATED_AT_DATETIME),'T',' ') as ?UPDATED_AT)
  filter(?c_updated_at1 > ?UPDATED_AT)
  graph ?GRAPH {?s ?p ?o}};

insert {graph ?GRAPH {
  ?cb_agent_uuid_URL a k2:Organization;
    k2:cbId ?uuid;
    k2:name ?name;
    k2:cbPermalink ?permalink;
    k2:cbUrl ?CB_URL;
    k2:rank ?RANK;
    k2:createdAt ?CREATED_AT_FIXDATE;
    k2:updatedAt ?UPDATED_AT_FIXDATE;
    k2:legalName ?legal_name;
    k2:organizationRole ?cb_organizationRole_roles_SPLIT1_URLIFY_URL;
    k2:domain ?domain;
    k2:homepageUrl ?HOMEPAGE_URL;
    k2:countryCode ?country_code;
    k2:stateCode ?state_code;
    k2:region ?region;
    k2:city ?city;
    k2:address ?address;
    k2:postalCode ?postal_code;
    k2:status ?cb_organizationStatus_status_URLIFY_URL;
    k2:shortDescription ?short_description;
    k2:industry ?cb_industry_category_list_SPLIT1_URLIFY_URL;
    k2:numFundingRounds ?NUM_FUNDING_ROUNDS;
    k2:totalFundingUsd ?TOTAL_FUNDING_USD;
    k2:totalFunding ?TOTAL_FUNDING;
    k2:totalFundingCurrencyCode ?total_funding_currency_code;
    k2:foundedOn ?FOUNDED_ON_FIXDATE;
    k2:lastFundingOn ?LAST_FUNDING_ON_FIXDATE;
    k2:closedOn ?CLOSED_ON_FIXDATE;
    k2:employeeCount ?cb_employeeCount_employee_count_IFNOTNULL_URLIFY_URL;
    k2:email ?email;
    k2:phone ?phone;
    k2:facebookUrl ?FACEBOOK_URL;
    k2:linkedinUrl ?LINKEDIN_URL;
    k2:twitterUrl ?TWITTER_URL;
    k2:logoUrl ?LOGO_URL;
    k2:alias ?alias1;
    k2:alias ?alias2;
    k2:alias ?alias3;
    k2:primaryRole ?cb_organizationRole_primary_role_URLIFY_URL;
    k2:numExits ?NUM_EXITS.
}}
where {
  service <rdf-mapper:ontorefine:PROJECT_ID> {
    bind(?c_uuid as ?uuid)
    bind(?c_name as ?name)
    bind(?c_permalink as ?permalink)
    bind(?c_cb_url as ?cb_url)
    bind(?c_rank as ?rank)
    bind(?c_created_at as ?created_at)
    bind(?c_updated_at as ?updated_at)
    bind(?c_legal_name as ?legal_name)
    bind(?c_roles as ?roles)
    bind(?c_domain as ?domain)
    bind(?c_homepage_url as ?homepage_url)
    bind(?c_country_code as ?country_code)
    bind(?c_state_code as ?state_code)
    bind(?c_region as ?region)
    bind(?c_city as ?city)
    bind(?c_address as ?address)
    bind(?c_postal_code as ?postal_code)
    bind(?c_status as ?status)
    bind(?c_short_description as ?short_description)
    bind(?c_category_list as ?category_list)
    bind(?c_num_funding_rounds as ?num_funding_rounds)
    bind(?c_total_funding_usd as ?total_funding_usd)
    bind(?c_total_funding as ?total_funding)
    bind(?c_total_funding_currency_code as ?total_funding_currency_code)
    bind(?c_founded_on as ?founded_on)
    bind(?c_last_funding_on as ?last_funding_on)
    bind(?c_closed_on as ?closed_on)
    bind(?c_employee_count as ?employee_count)
    bind(?c_email as ?email)
    bind(?c_phone as ?phone)
    bind(?c_facebook_url as ?facebook_url)
    bind(?c_linkedin_url as ?linkedin_url)
    bind(?c_twitter_url as ?twitter_url)
    bind(?c_logo_url as ?logo_url)
    bind(?c_alias1 as ?alias1)
    bind(?c_alias2 as ?alias2)
    bind(?c_alias3 as ?alias3)
    bind(?c_primary_role as ?primary_role)
    bind(?c_num_exits as ?num_exits)
    bind(iri(concat("cb/agent/",?uuid)) as ?cb_agent_uuid_URL)
    bind(strdt(?cb_url,xsd:anyURI) as ?CB_URL)
    bind(strdt(?rank,xsd:integer) as ?RANK)
    bind(REPLACE(?created_at,' ','T') as ?created_at_FIXDATE)
    bind(strdt(?created_at_FIXDATE,xsd:dateTime) as ?CREATED_AT_FIXDATE)
    bind(REPLACE(?updated_at,' ','T') as ?updated_at_FIXDATE)
    bind(strdt(?updated_at_FIXDATE,xsd:dateTime) as ?UPDATED_AT_FIXDATE)
    ?roles_SPLIT1 spif:split (?roles ',').
    bind(LCASE(REPLACE(REPLACE(REPLACE(?roles_SPLIT1, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?roles_SPLIT1_URLIFY)
    bind(iri(concat("cb/organizationRole/",?roles_SPLIT1_URLIFY)) as ?cb_organizationRole_roles_SPLIT1_URLIFY_URL)
    bind(strdt(?homepage_url,xsd:anyURI) as ?HOMEPAGE_URL)
    bind(LCASE(REPLACE(REPLACE(REPLACE(?status, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?status_URLIFY)
    bind(iri(concat("cb/organizationStatus/",?status_URLIFY)) as ?cb_organizationStatus_status_URLIFY_URL)
    ?category_list_SPLIT1 spif:split (?category_list ',').
    bind(LCASE(REPLACE(REPLACE(REPLACE(?category_list_SPLIT1, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?category_list_SPLIT1_URLIFY)
    bind(iri(concat("cb/industry/",?category_list_SPLIT1_URLIFY)) as ?cb_industry_category_list_SPLIT1_URLIFY_URL)
    bind(strdt(?num_funding_rounds,xsd:integer) as ?NUM_FUNDING_ROUNDS)
    bind(strdt(?total_funding_usd,xsd:decimal) as ?TOTAL_FUNDING_USD)
    bind(strdt(?total_funding,xsd:decimal) as ?TOTAL_FUNDING)
    bind(REPLACE(?founded_on,' ','T') as ?founded_on_FIXDATE)
    bind(strdt(?founded_on_FIXDATE,xsd:dateTime) as ?FOUNDED_ON_FIXDATE)
    bind(REPLACE(?last_funding_on,' ','T') as ?last_funding_on_FIXDATE)
    bind(strdt(?last_funding_on_FIXDATE,xsd:dateTime) as ?LAST_FUNDING_ON_FIXDATE)
    bind(REPLACE(?closed_on,' ','T') as ?closed_on_FIXDATE)
    bind(strdt(?closed_on_FIXDATE,xsd:dateTime) as ?CLOSED_ON_FIXDATE)
    bind(if(?employee_count in ("other","not provided","unknown"),?UNDEF,?employee_count) as ?employee_count_IFNOTNULL)
    bind(LCASE(REPLACE(REPLACE(REPLACE(?employee_count_IFNOTNULL, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?employee_count_IFNOTNULL_URLIFY)
    bind(iri(concat("cb/employeeCount/",?employee_count_IFNOTNULL_URLIFY)) as ?cb_employeeCount_employee_count_IFNOTNULL_URLIFY_URL)
    bind(strdt(?facebook_url,xsd:anyURI) as ?FACEBOOK_URL)
    bind(strdt(?linkedin_url,xsd:anyURI) as ?LINKEDIN_URL)
    bind(strdt(?twitter_url,xsd:anyURI) as ?TWITTER_URL)
    bind(strdt(?logo_url,xsd:anyURI) as ?LOGO_URL)
    bind(LCASE(REPLACE(REPLACE(REPLACE(?primary_role, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?primary_role_URLIFY)
    bind(iri(concat("cb/organizationRole/",?primary_role_URLIFY)) as ?cb_organizationRole_primary_role_URLIFY_URL)
    bind(strdt(?num_exits,xsd:integer) as ?NUM_EXITS)
    bind(?c_updated_at as ?c_updated_at1)
    bind(iri(concat("cb/graph/organizations/",?c_uuid)) as ?GRAPH)}
  <cb> k2:updatedAt ?UPDATED_AT_DATETIME.
  bind(replace(str(?UPDATED_AT_DATETIME),'T',' ') as ?UPDATED_AT)
  filter(?c_updated_at1 > ?UPDATED_AT)}
    bind(if(?employee_count in ("other","not provided","unknown"),?UNDEF,?employee_count) as ?employee_count_IFNOTNULL)
    bind(LCASE(REPLACE(REPLACE(REPLACE(?employee_count_IFNOTNULL, "[^\\p{L}0-9]", "_"), "_+", "_"), "^_|_$", "")) as ?employee_count_IFNOTNULL_URLIFY)
    bind(iri(concat("cb/employeeCount/",?employee_count_IFNOTNULL_URLIFY)) as ?cb_employeeCount_employee_count_IFNOTNULL_URLIFY_URL)

I would NOT think to write such queries by hand because the chance to make an error is too high.

VladimirAlexiev commented 2 years ago

Here's a challenge to the KG Construction CG:

The previous comment outlines Ontotext's approach using OntoRefine and generation of transformations from models.

Some values need special processing:

dachafra commented 2 years ago

@VladimirAlexiev could we transfer this to https://github.com/kg-construct/best-practices/issues? I think atm fits better the issue there

VladimirAlexiev commented 2 years ago

@dachafra Sure! Is there a way to do this automatically? BTW I posted it as a "Crunchbase challenge" to https://lists.w3.org/Archives/Public/public-kg-construct/2022Mar/0000.html

dachafra commented 2 years ago

done!

ziodave commented 2 years ago

I can provide the models

@VladimirAlexiev can you provide the models? 😇

VladimirAlexiev commented 2 years ago

@ziodave https://gist.github.com/VladimirAlexiev/d5d67feb002dbcfa6b3d4c3dd59b52da

namedgraph commented 2 years ago

We've done something like that using CSV2RDF for the Northwind Traders dataset. The data volume is much lower but the use case is similar. The transformation queries can be found here: https://github.com/AtomGraph/LinkedDataHub-Apps/tree/develop/demo/northwind-traders/queries/imports

VladimirAlexiev commented 2 years ago

It does look similar. But for large input files , making output files rather than inserting directly adds significant time. Also, Tarql cannot construct named graph per row (even though Jena support graph in Construct), so updating would be difficult. That's why we didn't use Construct in this case.

ktk commented 2 years ago

@VladimirAlexiev I can try on our pipeline stack @zazuko. Do you have a compiled vocabulary with all terms so I don't have to fetch them from the various files? Can be a textfile but something easily copy/pastable

VladimirAlexiev commented 2 years ago

@ktk Added to gist above: You can generate an ontology from cb-model.tsv that describes all classes and fields.