lter / LTER-core-metabase

RDB model, based on the GCE LTER Metabase, with adaptations from other LTER sites
https://lter.github.io/LTER-core-metabase/
GNU General Public License v3.0
9 stars 4 forks source link

extend metabase model for semantic annotation #15

Open mobb opened 5 years ago

mobb commented 5 years ago

ask Sven @gastil

mobb commented 5 years ago

their conversation below

https://github.com/gastil/litterbox/blob/master/sql/semantic_annotation_db_model_suggestion.sql

Hmm. I was thinking even simpler. (this probably won't run but it should be close)

psuedocode:

create table semantic_annotation_terms (
  id serial primary key,
  term text, // text label for selecting the term
  vocb_url text unique
)

create table semantic_annotation_attributes_terms (
  id serial primary key,
  dataset_id integer,
  entity_id integer,
  attribute_id integer,
  annotation_id integer,
  constraint "fk1" foreign key (attribute_id) references "DataSetVariables2" (dataset_id, entity_id, attribute_id)
  constraint "fk2" foreign key (annotation_id) references semantic_annotation_terms (id)
)
gastil commented 5 years ago

I suggest we order the columns of the many-to-many xref table as (di, ei, ai, term_id) instead of having the id for the term be first. Reason is that there are lots of xref tables in mb and all the others have (di, ei, ai) to the left of the thing being xref-ed... I think. I realize that column order does not matter to the db. But it matters for user readability.

The above table structure does not yet accommodate semantic annotation terms for taxonomy. I want the model (the tables) to provide for semantic annotation at all the levels/places it goes into EML2.2. I think there are 5 places? @mobb ? di, ei, ai, taxonomy and additionalMetadata? Or maybe it can go anywhere there is a thing. If the latter is true, then let's just handle the places we expect it to be used in LTER datasets. Otherwise we'd have to have a column like xpath! (No, not suggesting that.)

Maybe taxonomy needs its own table for semantic annotation. Or, if the terms are 1-to-1 with rows of another taxonomy table then add them to that.

gastil commented 5 years ago

Reminder to self: get an example csv from the postdoc at nceas (Steve?) to be sure our mb table for semantic annotation can ingest the terms he gives sites, to put the terms first in our dictionary table (parent table, a cv of cv terms), then into the xref table to put the SA at dataset, entity, or attribute (or other) level in eml.

Side note on table naming convention: Currently we have EMLTableNames, ListTableNames, and DataSetTableNames. SA is not precisely EML, but if units and vocab are, then I'll suggest EMLSemanticTerms rather than ListSemanticTerms. Hopefully these terms will be given us at network level as well as specific to sites. (If the terms cv were limited to sites' own, then ListSemanticTerms. The xref table is clearly DataSetSemanticTerms. Note Terms has fewer letters than Annotation, but if you all prefer DataSetSemanticAnnotation, that's maybe better?)

more later

gastil commented 4 years ago

Revision: a lot of what I wrote below is not right. Ignore me.

Margaret provided an example of the terms Steven Chong is finding for Arctic Data Center datasets. (See screenshot below). My idea is to provide a way to ingest the information into Metabase. This table is not a good storage design. Maybe the first step is to load this table into scratch, then run queries on it to INSERT INTO mb-table FROM SELECT... FROM (scratch-table).

Record_number is just a row number; no meaning. identifier is the dataset doi (in ADC) and could be joined to DataSetID. entityName same as mb. We might opt for EntitySort Order, but can join that. attributeName same as mb. Ditto. attributeDefinition just do not use; It should match what is already in mb. We could do a QC to see if it does indeed match. attributeLabel ditto. attributeUnit ditto. ECSO_ this is just the very tail end of the term URI. This example only uses one vocab, ECSO. This is the key column. (Key having both meanings: important, and how the terms are indexed.)

the select-into might go something like ...

INSERT INTO lter_metabase."DataSetSemanticAnnotation" ("TermID", "DataSetID", "EntityName", "AttributeName", 'is-measurement-of') -- default object property for an att
FROM (
SELECT s."TermID", d."DataSetID", entityname, attributename
FROM scratch.file_from_steven_chong f
JOIN scratch.dataset_dois d
ON d.doi=f.identifier
JOIN lter_metabase."EMLSemanticAnnotation" s
ON 'ECSO_uri_head_stuff_goes_here'||f.ecso = s."TermURI"
);

The joins above are to translate a dataset doi into a DataSetID and to translate the short-hand tail number provided in the csv into a full semantic annotation term URI, and use the TermID for that. The parent table for terms we already knew we needed. Likely just these columns: TermID, TermURI, TermLabel. Another parent table, not used here because I'm assuming the default object property for attributes of is-a-measurement-of, might have these columns: ObjectID, ObjectURI, ObjectLabel. That is-a-measurement-of would be the ObjectID, and would more likely just be the character m.

The table not mentioned before is the dataset doi table. I had not realized we would be provided the doi to indicate which dataset. (A quirk: a doi is a specific revision of a dataset. But a given attribute is likely to keep the same SA for all revisions. So the look-up is not as trivial as I first assumed. You could do a pre-processing step to resolve all the doi's. Or you could make a table of every revision of every dataset and all those dois. Hmmm.)

Screen-shot of Steven Chong's spreadsheet: ADC_annotation_spreadsheet

atn38 commented 4 years ago

Ooh, sneak peek. I didn't realize Steven was gonna provide dataset-specific annotations. (I imagined it would be a set of terms and corresponding URI, and sites then go and apply those terms as they judge fitting to their own datasets.) In this case SELECT INTOs totally make sense. We'd need a table (or set of tables) in MB to be SELECT'ed INTO, so I began there. Here's a draft:

CREATE TABLE lter_metabase."DataSetAnnotations" (
    "DataSetID" integer NOT NULL,
    "EntitySortOrder" integer,
    "ColumnName" varchar(100),
    "AnnotationID" integer NOT NULL
    "PropertyURI" varchar(500) NOT NULL,
    "PropertyLabel" varchar(100) NULL,
    "ValueURI" varchar(500) NOT NULL,
    "ValueLabel" varchar(100) NULL
);

ALTER TABLE lter_metabase."DataSetAnnotations" 
    ADD CONSTRAINT datasetannotations_pk 
    PRIMARY KEY ("DataSetID","EntitySortOrder","ColumnName","AnnotationID");

ALTER TABLE lter_metabase."DataSetAnnotations" 
    ADD CONSTRAINT datasetannotations_dataset_fk 
    FOREIGN KEY ("DataSetID") REFERENCES lter_metabase."DataSet"("DataSetID") ON UPDATE CASCADE;

I thought about making DataSetAnnotations, DataSetEntityAnnotations, and DataSetAttributeAnnotations. But I thought we can use the "put 0 if not applicable at this level" trick to specify where the annotation should go. Caveat: cannot foreign key beyond DataSetID. Each observation unit (row in table) should be a complete triplet, so PropertyURI and Label is included. The Object is the same as your Term.

Thoughts? We also are not including annotations at eml/annotations and eml/additionalMetadata, but I think that's fine.

atn38 commented 4 years ago

Ok, back at this again after yesterday's Slack discussion:

SA could be modelled roughly like keywords and taxonomy

Parent control tables: a ListSemanticAnnotations: TermID, TermLabel, TermSource (or TermOntology, basically where the term came from). PK is TermID + TermSource a ListSemanticOntologies: Ontology

Xref tables: DataSetAnnotations: DataSetID, PropertyTermID, ValueTermID

MetaEgress: somehow constructs URI from the TermID and TermSource. Not sure if we can do this, and certainly not for any given ontology the user might enter. In this case, it's better to use TermURI in place of TermID. We already use URLs insteads of IDs in the case of user IDs.

On Mon, Sep 23, 2019 at 2:51 PM Gastil notifications@github.com wrote:

Margaret provided an example of the terms Steven Chong is finding for Arctic Data Center datasets. My idea is to provide a way to ingest the information into Metabase. This table is not a good storage design. Maybe the first step is to load this table into scratch, then run queries on it to INSERT INTO mb-table FROM SELECT... FROM (scratch-table).

Recordnumber is just a row number; no meaning. identifier is the dataset doi (in ADC) and could be joined to DataSetID. entityName same as mb. We might opt for EntitySort Order, but can join that. attributeName same as mb. Ditto. attributeDefinition just do not use; It should match what is already in mb. We could do a QC to see if it does indeed match. attributeLabel ditto. attributeUnit ditto. ECSO this is just the very tail end of the term URI. This example only uses one vocab, ECSO. This is the key column. (Key having both meanings: important, and how the terms are indexed.)

the select-into might go something like ...

INSERT INTO lter_metabase."DataSetSemanticAnnotation" ("TermID", "DataSetID", "EntityName", "AttributeName", 'is-measurement-of') -- default object property for an att FROM ( SELECT s."TermID", d."DataSetID", entityname, attributename FROM scratch.file_from_steven_chong f JOIN scratch.dataset_dois d ON d.doi=f.identifier JOIN lter_metabase."EMLSemanticAnnotation" s ON 'ECSO_uri_head_stuff_goes_here'||f.ecso = s."TermURI" );

The joins above are to translate a dataset doi into a DataSetID and to translate the short-hand tail number provided in the csv into a full semantic annotation term URI, and use the TermID for that. The parent table for terms we already knew we needed. Likely just these columns: TermID, TermURI, TermLabel. Another parent table, not used here because I'm assuming the default object property for attributes of is-a-measurement-of, might have these columns: ObjectID, ObjectURI, ObjectLabel. That is-a-measurement-of would be the ObjectID, and would more likely just be the character m.

The table not mentioned before is the dataset doi table. I had not realized we would be provided the doi to indicate which dataset. A quirk: a doi is a specific revision of a dataset.

... more later ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lter/LTER-core-metabase/issues/15?email_source=notifications&email_token=AKAZD5UGXDCQLCORANLYANTQLEM4XA5CNFSM4G2WGVL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MBZMQ#issuecomment-534256818, or mute the thread https://github.com/notifications/unsubscribe-auth/AKAZD5WMQBZF4DGR47EUDLLQLEM4XANCNFSM4G2WGVLQ .

mobb commented 4 years ago

@gastil asked me for the expected length of ontology URIs and labels for the semantics table. Currently, the longest URI or label is 97 characters. So if you set these to varchar 200 you should be safe.

For reference, the ECSO vocab @mobb used to get that info for us: http://bioportal.bioontology.org/ontologies/ECSO/?p=classes&conceptid=http%3A%2F%2Fpurl.dataone.org%2Fodo%2FECSO_00000524

And the (temporary) schemaSpy she made for us to browse the semantic_annotation schema.

The SQL for semantic annotation is in the "patch" file 33_semantic_annotation.sql in the migration branch. This is still in review and will not be included in the V1.0 release.

atn38 commented 4 years ago

in vw_eml_semantic_annotation, column names should be datasetid and entity_position.

i would also use attributeName instead of columnPosition in the view, which necessitates using it in the DataSetAttributeAnnotations tables (to avoid adding another table to the join). Also attributeName is part of the PK for DataSetAttributes, not columnPosition.

Want me to make the edits and push it?

gastil commented 4 years ago

I expected you would want to tailor the view. Do not put attributeName into the SA table just to avoid a join. That is what join is for. Notice I added a UNIQUE constraint to DataSetAttributes so that (di,ei,ai) can be used as an alternate key. That also was needed to prevent users from entering non-unique column positions, as one user already bumped into.

Although I prefer (di,ei,ai), I could see using att name instead of column position for the purpose of giving the human editor hints. If we do that, we should not have both column position and att name. The human editors are not going to be entering SA without the parent tables all open for reference. And even more likely is to be loading the SA from a csv, likely trimmed down from a much wider csv that includes column names (and labels and descriptions and units) that had been used in earlier steps.

Do you want me to write the join from column position to att name?

atn38 commented 4 years ago

Ha, I was hoping my lazy excuse would fly under the radar. I can write the join (experimenting with MetaEgress atm which is why I was fuddling with the view). Note that DataSetAttributeEnumeration and DataSetAttributeMissingCodes both use attributeName for the ai part.

On Tue, Oct 8, 2019 at 5:18 PM Gastil notifications@github.com wrote:

I expected you would want to tailor the view. Do not put attributeName into the SA table just to avoid a join. That is what join is for. Notice I added a UNIQUE constraint to DataSetAttributes so that (di,ei,ai) can be used as an alternate key. That also was needed to prevent users from entering non-unique column positions, as one user already bumped into.

Although I prefer (di,ei,ai), I could see using att name instead of column position for the purpose of giving the human editor hints. If we do that, we should not have both column position and att name. The human editors are not going to be entering SA without the parent tables all open for reference. And even more likely is to be loading the SA from a csv, likely trimmed down from a much wider csv that includes column names (and labels and descriptions and units) that had been used in earlier steps.

Do you want me to write the join from column position to att name?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lter/LTER-core-metabase/issues/15?email_source=notifications&email_token=AKAZD5WEP4B5UAZTOFGKAKDQNUBL5A5CNFSM4G2WGVL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAVZMPI#issuecomment-539727421, or mute the thread https://github.com/notifications/unsubscribe-auth/AKAZD5TUAQZQFWMEU2YRBDLQNUBL5ANCNFSM4G2WGVLQ .