Open aclayton555 opened 3 months ago
scope of this is to do a little digging to see if there is something going on that we can/need to fix. Upon investigation, decide whether further escalation to FAIR Data is needed
The linked ticket starts with an issue with Skin of lower limb and hip
In HTAN.model.jsonld the displayName for Skinoflowerlimbandhip is "skin of lower limb and hip" (lowercase s). Not sure if that could be the cause.
Skin of lower limb and hip
(case sensitive) is a valid value for 5 attributes in the data model. [Progression or Recurrence Anatomic Site
, Treatment Anatomic Site
, Site of Resection or Biopsy
, Tissue or Organ of Origin
, Additional Topography
]
skin of lower limb and hip
(case sensitive) is a valid value for 1 attributes in the data model: [Melanoma Biopsy Resection Sites
]
My guess is that this is leading to a namespace clash on conversion to the JSON-LD.
Next step is to investigate if skin of lower limb and hip
has been already used for Melanoma Biopsy Resection Sites
in existing or released metadata.
This essentially becomes a clone/offshoot of our investigation into title/lower-case near duplictes and actions needed in backlog for #176
Note all the Melanoma Biopsy Resection Sites
valid values appear to be in lowercase and are likely causing clashes with other attributes
"skin of scalp, skin of eye lid, skin of nose, skin of lip, skin of ear, skin of neck, skin of other parts of face, skin of chest, skin of back, skin of abdomen, skin of trunk-other, skin of breast, skin of upper limb and shoulder, skin of palm, skin of lower limb and hip, skin of sole, skin of penis, skin of scrotum, skin of vulva, skin other, skin NOS, Not Reported",
Melanoma Biopsy Resection Sites
is part of the MelanomaTier3
component
We count the number of distinct entries for this from google BigQuery
SELECT
DISTINCT(Melanoma_Biopsy_Resection_Sites),
COUNT(*) as n
FROM `htan-dcc.combined_assays.MelanomaTier3`
GROUP BY Melanoma_Biopsy_Resection_Sites
ORDER BY n DESC
Melanoma_Biopsy_Resection_Sites | n |
---|---|
Not Reported | 17 |
Skin of upper limb and shoulder | 13 |
Skin of back | 9 |
Skin NOS | 8 |
Skin of lower limb and hip | 6 |
Skin of scalp | 3 |
Skin of abdomen | 2 |
Skin of sole | 1 |
Skin of vulva | 1 |
Skin of ear | 1 |
skin other | 1 |
Skin of chest | 1 |
These are all in first-letter-uppercase suggesting that the lowercase valid values in the data model have not been followed (maybe they were not implemented at the time?
The only weird thing is that skin other
is lowercase. To confirm how this is appearing in the date model.
The only occurrence of skin other
(case insensitive) in the data model is for Melanoma Biopsy Resection Sites
All the other actual values submitted for Melanoma Biopsy Resection Sites
appear either in Site of Resection or Biopsy
(eg Skin of lower limb and hip
) or Additional Topography
(eg Skin of sole
)
Note Additional Topography
appears to be only used in the SRRS Biospecimen
component - so just for the SRRS TNP and not for general HTAN center usage
Next will look at Yes
vs yes
In the CSV: "yes" is a valid value for "Treatment or Therapy" only where as title "Yes" is more frequently used
In the data model we see that upper case Yes is used in the valid value within the JSON-LD
"@id": "bts:TreatmentorTherapy",
"@type": "rdfs:Class",
"rdfs:comment": "A yes/no/unknown/not applicable indicator related to the administration of therapeutic agents received.",
"rdfs:label": "TreatmentorTherapy",
"rdfs:subClassOf": [
{
"@id": "bts:Therapy"
}
],
"schema:isPartOf": {
"@id": "http://schema.biothings.io"
},
"schema:rangeIncludes": [
{
"@id": "bts:Yes"
},
{
"@id": "bts:No"
},
{
"@id": "bts:Unknown"
},
{
"@id": "bts:NotReported"
}
],
"sms:displayName": "Treatment or Therapy",
"sms:required": "sms:false",
"sms:validationRules": []
},
My hypothesis is that where there are case differences the JSON-LD converter is now harmonising based on the title case version. I wonder if in the past it took both, or harmonized in the lower case version.
Action for next sprint. Escalate to FAIR. Suggest @aditigopalan work with them to confirm this hypothesis or understand how cases for the JSON-LD
Looking back to Aug 2023 data model release I don't see a change in behavior
This is a problem we will need to engage with FAIR Data on in the future to figure out how to clean this up based on latest expected behavior of schematic. Push this back to baclog and mark for renewal.
Ticket for us to look into potential causes related to a couple of issues that DCC members have seen where the case of certain valid values (e.g. "Yes" vs "yes) is throwing errors.
Recent example: https://sagebionetworks.jira.com/browse/HTAN-402
Alex also mentioned that he encountered issues with this when recently interacting with the Publications schema.