Open Tasilee opened 5 years ago
Much (or all) of the vocabulary will come out of the framework as a technical specification, probably with additional supporting vocabularies (such as for values for data quality dimension). There is still a need to express the tests themselves as a formal specification (s.l.) and move this towards a TDWG standard.
There may (probably) be terms associated with the Tests and Assertions beyond the Framework. The Framework's terms are broader than just Darwin Core - the TG2 tests need to define some terms that are outside the Framework (CORE is one that comes to mind. Whether it makes sense or not to expand the Framework terms to cover these is probably worth discussing.
I propose that the framework should have a distinct vocabulary product consisting of the framework terms and their controlled vocabularies of values. I propose that there we a task group spawned specifically to create this. Tests and assertions rely on these for rigorous definition, so to me it has highest priority as a new vocabulary.
@tucotuco sounds like a deliverable from TG1.
Agree!
Are there terms that we may be using in TG2 that are outside of the Framework such that we need a separate vocabulary (but where most terms link to the Framework, or to Darwin Core)? I will go through the Tests and pull out a list of terms for which I think may need definitions.
We do have terms in TG2 that are beyond the Framework. I'll work with @ArthurChapman to generate the list.
In the PSSR-CORE (Citizen Science) document from 2017 (https://www.wilsoncenter.org/sites/default/files/wilson_171204_meta_data_f2.pdf)
One of the tasks mentioned in that report is 1 Understand and develop a common vocabulary for discussing the range of data quality practices in citizen science.
Peter Brenton is going to send me the name of someone from that working group with whom we should liaise
@Tasilee @pzermoglio and others. What columns do we require in our draft dq vocabulary? Some initial suggestions
Term | Definition | Source | Reference | Link | GUID |
Let me suggest the list of columns found in the header column in this Audubon Core source document: https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/audubon-column-mappings.csv
In particular:
label | rdfs_comment | dcterms_description | term_localName | term_isDefinedBy | term_created | term_modified |
---|---|---|---|---|---|---|
(=skos:preferedLabel (en)) | (=skos_definition) | (=dcterms:description) | (with term_isDefinedBy forms the guid) | (=dcterms:isPartOf) | (=dcterms:created) | (=dcterms:modified) |
We should also check skos for appropriate skos terms for source, reference, and link.
Thanks @chicoreus - I will look at that. We have a few different processes - the DQ Vocabulary - and that will depend a lot on what @pzermoglio comes up with. TG1 - Vocabulary will form a major part of the Vocabulary. I am looking at extracting the terms from the tests and just want to make sure we capture what we need at this stage so that we can then add the terms to the main Vocabulary we develop and not have to revisit things later.
Those columns of @ArthurChapman are 95% the table from my keynote :) and ah yes, there is SKOS
For Darwin Core, the full set of columns to manage the term definitions, usages, and examples is:
iri,label,definition,comments,examples,organized_in,issued,status,replaces,rdf_type,term_iri,abcd_equivalence,flags
From @tucotuco: "To me it is clear that the Framework will result in a vocabulary that should be made into a standard. To me this is separate from a possible standard arising from the tests and assertions. These are two distinct products to me, with the latter relying on the former, thus increasing the priority of the former."
"Steve Baskauf clarified that the TDWG Standards Documentation Standard (SDS; https://www.tdwg.org/standards/sds/, in its single document, the "TDWG Standards Documentation Specification") describes how to create Data standards (for Vocabularies) as well as Best Current Practices documents. The Vocabularies of Values Best Current Practices Document must conform with that document, just as any vocabularies of values must also conform to the specifications set out in the SDS. The DQIG believes that a Vocabularies of Values Best Current Practices document is needed to provide more specific and common guidance on vocabularies of values construction and maintenance - for example, guidance on the type of vocabulary to use (Thesaurus, Vocabulary, Dictionary, Ontology, etc.), and how to deal with synonymy, multiple languages, etc."
I suggest then a rename of this issue to "TG2-" and tags. Alan, Miles...can create a separate issue. :)
I have added the terms above that I feel may need defining for the Tests. I have attempted to put in a definition - although some are still blank. If you have any suggestions, comments, etc. please comment.
The Terms in UPPER CASE - are terms used in the Title of the Test
I am not sure that we need to define CORE. It is a working term and the criteria by which tests are included will, I am sure, be described in the Introduction to the Standard.I suggest we delete it from the Vocabulary
Does "Output Type" come from the Framework @allankv ? If not I suggest we change the term in the tests as it doesn't make a lot of sense when you look at the required values (Validation, Amendment, Notification and Measure)
@ArthurChapman I think "output type" is just mapping to owl:type, see: https://github.com/kurator-org/kurator-ffdq/blob/master/competencyquestions/rdf/ffdq.owl In the conversion to a csv file, it gets renamed "Type". No need to change the value in the current markdown tables in the issues, but also no need to define Type in a tdwg namespace.
@ArthurChapman "CORE" is at least the label of the Data Quality Profile that comprises the suite of tests produced by TG2. but I think more likely, it is an instance of a use case that helps identify the elements of the validation policy, measurement policy, and enhancement policy that form the data quality profile for the tests. See the Data Quality Needs section of the framework: https://github.com/tdwg/bdq/wiki/TG1-Framework-Cheat-Sheet
@chicoreus. You have added Information Element to, for example, COORDINATES - but this isn't an Information Element but a term we use in the title to indicate the Target. Sometimes the Target is a dwc:term but not always. so I don't think we should define it as such.
@chicoreus There are several like that. COORDINATES, GEOGRAPHY, POLYNOMIAL, YEARMONTHDAY, YEARENDDAYOFYEARSTARTDAYOFYEAR. Generally our test names go VALIDATION_TARGET_ACTION etc. where target is often an abbreviation of an Information Element (like YEAR, etc.), but in the cases above, and in several others - e.g. MAXELEVATION etc. they refer to one Information Element (e.g. MAXELEVATION is defined by - see the spreadsheet - by saying see dwc:maxElevationInMeters etc. The ones listed at the start of the Comment - COORDINATES, etc. can only be defined as a combination of several information elements.
The current definition of Information Elements given in the Framework do include a mix of things (Coordinates, Date, Time, Species, Specimen, Observation). Doing this exercise (TG2 and the Vocabulary), I am not sure the Framework Definition is a good one - or we have been a little inconsistent in our use of it to refer only to Darwin Core or Dublin Core Terms (except for bdq:isMarine - see below)
The only Information Element we don't have defined (through Darwin Core or Dublin Core) is bdq:isMarine(#51 ) and to would be nice to be able to describe that test without having to create a new Information Element.
Now I see two options:
In any case I think we need to look at the examples given in the Framework for IE - and rather than species we use Polynomial, etc.
@chicoreus I have added another term "Response.status" - can you please define this.
@chicoreus I have removed Output Type as you said elsewhere that when you export it to a csv. file this is changed to Type ... Thus we don't need a definition. Thanks.
Please note that the Definitions listed above are only definitions that are specific to the Tests. The full list of definitions - e.g. those that refer to other sources (Darwin Core, Dublin Core, TGN, The DQ Framework, etc.) are all included in a Spreadsheet which can be found at https://drive.google.com/open?id=1NCMFz_hBIACuuzruxo2mAfIeM_dNqwog0n2sPIf8SFw
@ArthurChapman regarding Response.Status. See the tables in https://github.com/tdwg/bdq/issues/142#issuecomment-376734516 and https://github.com/tdwg/bdq/issues/142#issuecomment-382475645
Regarding Information Element. COORDINATES, etc, are definitely information elements in the sense of the framework. Allan (and us in the publication on the framework) have used this form (and that specific case) of a composite information element multiple times. We've had to push hard to get the framework to accept the lists of specific terms as also being part of the definition of an information element, as well as this general label (the specific terms are necessary for implementation).
@chicoreus Then should we change the name of what we have as "Information Element" in the tests? to Terms or Term Elements perhaps?
@ArthurChapman I don't think so. They are both Information Elements under the framework, and then we'd have to create labels for information elements made up of single terms.
Perhaps clarifying more, we've got tests that take information elements composed of single darwin core terms (e.g. dwc:year), then others that take information elements composed of more than one term (yearmonthday(dwc:year,dwc:month,dwc:day), then, quite separately, we've got labels (like TIME) to categorize tests into those with information elements covering similar concepts.
Thanks @chicoreus - that is very helpful. BTW we have done away with NAME, TIME, SPACE and OTHER (except as Labels in GitHub because they can be adequately covered by the Darwin Core Classes Taxon, Event, Location, Occurrence and Record-level Terms
In discussion with @Tasilee, and in looking at our BISS paper, we have decided to change 'Action' to 'Response' as terms like 'EMPTY' 'PRECISIONINSECONDS', etc. are not Actions. Thus we now have the tests in the form of
I have modified and updated the Vocabulary for TG2. I have added some new terms, and some of the terms from the Framework (they are noted) with a link - however note that not all the Framework definitions are finalized. Terms from the Framework are in Italics in the Term column In some cases the definition, as written in the Framework, doesn't coincide 100% with the way we are using the term in the Tests, so I have modified the definition and made a note that it differs from the Framework Definition. In doing this, I think it is clear that some of the Framework definitions may need to be modified.
Please check them thoroughly, and make any comments suggestions, etc.
@chicoreus. Just trying to get my head around the definitions of the Namespace elements. I am not sure about the bdq:use.... elements e.g.
bdq:useEarliestValidDate (should this be bdq:earliestValidDate for consistency with ...ValidDepth and ...ValidElevation? or bdq:earliestDate)
bdq:useEarliestValidDate: A Parameter (q.v.) that ..... (@chicoreus - help needed)
bdq:earliestValidDate: A Parameter (q.v.) that optionally establishes the earliest date in a parameterized test. A default date is supplied in cases where a parameter is not set at the time the test is run.
@ArthurChapman, when a parameter such as bdq:earliestValidDate is composed with a test specification that states the parameter should be optionally applied, then another parameter is entailed to assert whether or not the earliestValidDate parameter should be applied or not.
Expectation would be in the form:
bdq:earliestValidDate=1700 bdq:useEarliestValidDate=true, test for dates back to 1700.
bdq:earliestValidDate= bdq:useEarliestValidDate=true, test for dates back to the specified default value for earliestValidDate.
bdq:earliestValidDate=1700 bdq:useEarliestValidDate=false, dates have no lower bound.
bdq:earliestValidDate= bdq:useEarliestValidDate=false, dates have no lower bound.
bdq:use{Foo} a parameter that, when equal to true, asserts that the parameter foo should be applied in the test where its application is optional, if no value is provided for the parameter foo, then the default value is applied. When the useFoo parameter has a value equal to false, then the test where the application of foo is optional does not use a given value of foo or the specified default value of foo in the test.
Note that use{Foo} parameters will only be coupled with foo parameters when a test specification asserts that the application of the foo parameter in the test is optional, e.g. a test that specifies an optional lower bound with a default lower limit may use that default lower limit, may use a specified lower limit, or may not test at all for a lower limit.
Given bdq:foo default=1, (1) where foo= and useFoo=true, then the default value of foo of 1 is used in the test. (2) where foo=2 and useFoo=true, then the provided value of foo of 2 is used in the test. (3) where foo= and useFoo=false, then foo is not tested for. (3) where foo=2 and useFoo=false, then foo is not tested for.
I have added definitions for all the bdq name space elements (bdq:...). I have also added a definition for "paramaterized test". Could you please check these definitions, before I add them to the BISS paper.
Nice work @ArthurChapman
Should we add "CHANGED" and "NOT CHANGED" as we are likely to standardize on these terms? Is there another context to use "AMENDED"? There may be a few others in the Expected Responses. I'll start checking.
Added NOT_AMENDED, per discussion on 2022 Feb 27.
Some TERMS that we may need to add to Glossary [NB as added, add an "x" between the square brackets]
Terms to be edited
Terms to delete?
Thanks Arthur
I'm unsure of the context of some of these terms. I will add discussion of the column headers of the test data worksheet to the next agenda.
Following up on vocabularies, we have confusion with #164 and given @chicoreus comment about 4 vocabularies required for the standard and a lack of consistent context...which often comes first in the Definition column.
May I suggest
I add a new COLUMN 3 called something like "Context" with values such as "DQ-DIMENSION" and "Warning Type" that are currently embedded in the Definitions. I find the current structure messy and inconsistent. Example: Reference to FFU is too broad.
We need to check that we have terms and definitions from the column headers of the various tables we use. For example, we use the term "Label" in what I would call the test specifications (the top table on the test issues), e.g., "AMENDMENT_TAXONID_FROM_TAXON" but this is not in the vocabulary. Ditto for example "Comment" in the test data worksheet. Should this be "Test.Data.Comment" or something else to make it unique? Yes, some terms may not be 'normative', but if we use them in a consistent context, it does no harm to include them here. I am happy to add in those terms with the understanding that we ALL need to review the terms and their parameters at some point when the test data is finalized.
On Tue, 08 Mar 2022 16:23:14 -0800 Lee Belbin @.***> wrote:
For example, we use the term "Label" in what I would call the test specifications (the top table on the test issues), e.g., "AMENDMENT_TAXONID_FROM_TAXON"
This is the rdfs:label.
See the RDF representation of the tests (I've just updated today)
431467d6-9b4b-48fa-a197-cd5379f5e889 is the identifier for a Specification, AMENDMENT_TAXONID_FROM_TAXON is it's label.
I have added "response.comment" into the list above of terms that need to be added.
I have also added "rdfs:label" into the terms that need to be added above
OK. I've done a first pass through the table and naturally, there are plenty of things to discuss. The new layout seems far better to me.
I have added "Response.comment" and "rdfs:label"
I've ticked those off in the list of terms to be added above.
I have added "null", "NOT_COMPLIANT" and "RUN_HAS_RESULT" to the vocabulary.
@Tasilee Note that we have a file with lots more terms, many not included within this file but with definitions. See https://docs.google.com/spreadsheets/d/1NCMFz_hBIACuuzruxo2mAfIeM_dNqwog0n2sPIf8SFw/edit#gid=1530751621
I have added "non-printing characters" and attempted to define them. Also "RUN_HAS_RESULT"
Following email discussions with respect to Test #101 whether the values in a primary Darwin Core term are Consistent with values in the values in the atomic terms (e.g. dwc:scientificName with dwc:genus, dwc:specificEpithet and dwc:infraspecificEpithet) when one or more of the atomic terms are EMPTY. It was agreed that for the purposes of the tests if one or more of the atomic terms is empty and the others are consistent then the the test should return COMPLIANT for consistency (for example in if dwc:infraspecificEpithet is empty but that dwc:genus and dwc:specifiecEpithet are consistent with the values in dwc:scientificName - then it is COMPLIANT). Thus - I have changed the definition of Consistency
from:
"Agreement among related Information Elements (q.v.) in the data."
to:
"Agreement among related Information Elements (q.v.) that are present in the data. Note that missing Information Elements do not make a test Inconsistent."
I have tried to add COMPLETE and NOT_COMPLETE to the Glossary - wording needs checking @chicoreus
COMPLETE: An assertion of a MEASURE (q.v.) where the VALIDATION (_q.v.) Result.results (q.v.) from all included records in the dataset are COMPLIANT (q.v.).
NOT_COMPLETE: An assertion of a MEASURE (q.v.) where not all the VALIDATION (_q.v.) Result.results (q.v.) from all included records in the dataset are COMPLIANT (q.v.).
Looking at the definitions. We have a term
Test prerequisite - Prerequisites to the test being run in the form of: fields having values, tests that need to be run before the current test, availability of a specified source authority (q.v.), etc.
I am not sure we use this term anywhere anymore. We agreed that tests don't have an order (other than VALIDATION-AMENDMENT-VALIDATION. The other parts of the definition are handled by INTERNAL_PREREQUISITES_NOT_MET and EXTERNAL_PREREQUISITES_NOT_MET Do we just delete this term?
Terms in the bdqffdq namespace are from the Fitness for Use Framework (Viega et al. 2017). Use the reference to the Framework Definitions for more details and examples. The use of a vocabulary term in a test specification without a namespace prefix (sometimes represented in all UPPER CASE), implies that the bdq: or bdqffdq: namespace is applicable. Note that wherever "DQ" is used in a definition it implies "Data Quality" and wherever "FFU Framework" is used it refers to the "Fitness for Use Framework" (Veiga et al. 2017).
Supplement: GitHub Label Terms These are terms that are outside the Standard but that have been used as either GitHub Labels or TestFields in the BDQ GitHub