Closed mdeagen closed 4 years ago
This is a problem for Matrix materials only, not Filler materials.
For compounds with role nm:Filler
, there are 35 unique names before and after applying LCASE, therefore there are currently no filler materials with duplicate capitalization.
# Show list of instances where sample has more than one "matrix" material (to check for duplicate names)
PREFIX nm: <http://nanomine.org/ns/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT (GROUP_CONCAT(DISTINCT ?matrix; SEPARATOR=", ") AS ?duplicate_matrix_matls) WHERE {
?sample a nm:PolymerNanocomposite ;
sio:hasComponentPart [ a [ rdfs:label ?matrix ] ;
sio:hasRole [ a nm:Matrix ] ]
}
GROUP BY ?sample
HAVING (COUNT(DISTINCT ?matrix) > 1)
Need to remove the following labels:
Why do only these four materials appear with a duplicate name? Error in SETLr?
SPARQL query that lists the duplicates directly:
# Show list of instances where sample has duplicate "matrix" label due to alternate capitalization
PREFIX nm: <http://nanomine.org/ns/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?matrix ?duplicate WHERE {
?sample a nm:PolymerNanocomposite ;
sio:hasComponentPart [ a [ rdfs:label ?matrix, ?duplicate ] ;
sio:hasRole [ a nm:Matrix ] ]
FILTER (?matrix != ?duplicate)
FILTER (LCASE(?matrix) = ?duplicate)
}
@rashidsabbir I just noticed that these materials, while typically Matrix materials, also act as Surface Treatment materials for some samples in the KG (and Surface Treatment material names are not standardized by ChemProps)... Maybe this is causing the issue? Not sure how the error is manifesting, but seems like an odd coincidence.
The issue appears to have been resolved.
Problem
Some of the "StdChemicalNames" (stored as type nm:compound) are non-unique.
Cause
Unknown, possibly during XML ingest/conversion
Bottom Line
Of the 61 unique names, only 4 polymers are duplicates, and these are listed at the bottom
Analysis
A query for all distinct standard chemical names (with role
nm:Matrix
) returns 65 results, but there is observed duplication due to alternate capitalization of the same polymer name.The same search using the LCASE function yields 61 entries, suggesting 4 material labels existing as alternate capitalizations of equivalent strings. See list of 4 problematic polymer names at the bottom.
Inspection of one of the XMLs (L290_S9_Si_2006.xml) confirms that there is only 1 StdChemicalName for each MatrixComponent, therefore the problem is likely occurring during the XML ingest/conversion.
Looking at samples with more than one label reveals the problematic polymer names (Note: Duplicate labels can also indicate that the matrix comprises multiple constituents).
The list of polymers with duplicate names due to alternate capitalization is:
In the above list, the names in bold should be kept.