tetherless-world / nanomine-graph

the visualization web app for nanomine project
MIT License
1 stars 4 forks source link

Duplicate "Standard Names" in KG (Matrix materials) #20

Closed mdeagen closed 4 years ago

mdeagen commented 4 years ago

Problem

Some of the "StdChemicalNames" (stored as type nm:compound) are non-unique.

Cause

Unknown, possibly during XML ingest/conversion

Bottom Line

Of the 61 unique names, only 4 polymers are duplicates, and these are listed at the bottom


Analysis

A query for all distinct standard chemical names (with role nm:Matrix) returns 65 results, but there is observed duplication due to alternate capitalization of the same polymer name.

PREFIX nm: http://nanomine.org/ns/ PREFIX sio: http://semanticscience.org/resource/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#

SELECT DISTINCT ?compoundlabel WHERE { ?sample a nm:PolymerNanocomposite ; sio:hasComponentPart [ sio:hasRole [ a nm:Matrix ]; a [ rdfs:label ?compoundlabel ] ] }

The same search using the LCASE function yields 61 entries, suggesting 4 material labels existing as alternate capitalizations of equivalent strings. See list of 4 problematic polymer names at the bottom.

PREFIX nm: http://nanomine.org/ns/ PREFIX sio: http://semanticscience.org/resource/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#

SELECT DISTINCT (LCASE(?compoundlabel) AS ?lclabel) WHERE { ?sample a nm:PolymerNanocomposite ; sio:hasComponentPart [ sio:hasRole [ a nm:Matrix ]; a [ rdfs:label ?compoundlabel ] ] }

Inspection of one of the XMLs (L290_S9_Si_2006.xml) confirms that there is only 1 StdChemicalName for each MatrixComponent, therefore the problem is likely occurring during the XML ingest/conversion.

Looking at samples with more than one label reveals the problematic polymer names (Note: Duplicate labels can also indicate that the matrix comprises multiple constituents).

PREFIX nm: http://nanomine.org/ns/ PREFIX sio: http://semanticscience.org/resource/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#

SELECT DISTINCT ?sample (COUNT(?compoundlabel) AS ?numlabels) (GROUP_CONCAT(DISTINCT ?compoundlabel; SEPARATOR=", ") AS ?ListOfLabels) WHERE { ?sample a nm:PolymerNanocomposite ; sio:hasComponentPart [ sio:hasRole [ a nm:Matrix ]; a [ rdfs:label ?compoundlabel ] ] } GROUP BY (?sample) HAVING (?numlabels > 1) ORDER BY DESC (?numlabels)


The list of polymers with duplicate names due to alternate capitalization is:

In the above list, the names in bold should be kept.

mdeagen commented 4 years ago

This is a problem for Matrix materials only, not Filler materials.

For compounds with role nm:Filler, there are 35 unique names before and after applying LCASE, therefore there are currently no filler materials with duplicate capitalization.

mdeagen commented 4 years ago
# Show list of instances where sample has more than one "matrix" material (to check for duplicate names)
PREFIX nm: <http://nanomine.org/ns/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?matrix; SEPARATOR=", ") AS ?duplicate_matrix_matls) WHERE {
  ?sample a nm:PolymerNanocomposite ;
          sio:hasComponentPart [ a [ rdfs:label ?matrix ] ;
                                 sio:hasRole [ a nm:Matrix ] ]
} 
GROUP BY ?sample
HAVING (COUNT(DISTINCT ?matrix) > 1)

Need to remove the following labels:

Why do only these four materials appear with a duplicate name? Error in SETLr?

mdeagen commented 4 years ago

SPARQL query that lists the duplicates directly:

# Show list of instances where sample has duplicate "matrix" label due to alternate capitalization
PREFIX nm: <http://nanomine.org/ns/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?matrix ?duplicate WHERE {
  ?sample a nm:PolymerNanocomposite ;
          sio:hasComponentPart [ a [ rdfs:label ?matrix, ?duplicate ] ;
                                 sio:hasRole [ a nm:Matrix ] ]
  FILTER (?matrix != ?duplicate)
  FILTER (LCASE(?matrix) = ?duplicate)
} 
mdeagen commented 4 years ago

@rashidsabbir I just noticed that these materials, while typically Matrix materials, also act as Surface Treatment materials for some samples in the KG (and Surface Treatment material names are not standardized by ChemProps)... Maybe this is causing the issue? Not sure how the error is manifesting, but seems like an odd coincidence.

mdeagen commented 4 years ago

The issue appears to have been resolved.