ncihtan / data-models

Schema.org Data Models for HTAN
MIT License
14 stars 6 forks source link

Investigate odd csv -> json-ld behaviour #368

Open aclayton555 opened 3 months ago

aclayton555 commented 3 months ago

Ticket for us to look into potential causes related to a couple of issues that DCC members have seen where the case of certain valid values (e.g. "Yes" vs "yes) is throwing errors.

Recent example: https://sagebionetworks.jira.com/browse/HTAN-402

Alex also mentioned that he encountered issues with this when recently interacting with the Publications schema.

aclayton555 commented 3 months ago

scope of this is to do a little digging to see if there is something going on that we can/need to fix. Upon investigation, decide whether further escalation to FAIR Data is needed

adamjtaylor commented 3 months ago

The linked ticket starts with an issue with Skin of lower limb and hip

In HTAN.model.jsonld the displayName for Skinoflowerlimbandhip is "skin of lower limb and hip" (lowercase s). Not sure if that could be the cause.

Skin of lower limb and hip(case sensitive) is a valid value for 5 attributes in the data model. [Progression or Recurrence Anatomic Site, Treatment Anatomic Site, Site of Resection or Biopsy, Tissue or Organ of Origin, Additional Topography] skin of lower limb and hip(case sensitive) is a valid value for 1 attributes in the data model: [Melanoma Biopsy Resection Sites]

My guess is that this is leading to a namespace clash on conversion to the JSON-LD.

Next step is to investigate if skin of lower limb and hip has been already used for Melanoma Biopsy Resection Sites in existing or released metadata.

adamjtaylor commented 3 months ago

This essentially becomes a clone/offshoot of our investigation into title/lower-case near duplictes and actions needed in backlog for #176

adamjtaylor commented 3 months ago

Note all the Melanoma Biopsy Resection Sites valid values appear to be in lowercase and are likely causing clashes with other attributes

Melanoma Biopsy Resection Sites,Biopsy resection sites specific to melanoma (not covered in Tiers 1 and 2),"skin of scalp, skin of eye lid, skin of nose, skin of lip, skin of ear, skin of neck, skin of other parts of face, skin of chest, skin of back, skin of abdomen, skin of trunk-other, skin of breast, skin of upper limb and shoulder, skin of palm, skin of lower limb and hip, skin of sole, skin of penis, skin of scrotum, skin of vulva, skin other, skin NOS, Not Reported",,,FALSE,Melanoma Tier 3,,,

"skin of scalp, skin of eye lid, skin of nose, skin of lip, skin of ear, skin of neck, skin of other parts of face, skin of chest, skin of back, skin of abdomen, skin of trunk-other, skin of breast, skin of upper limb and shoulder, skin of palm, skin of lower limb and hip, skin of sole, skin of penis, skin of scrotum, skin of vulva, skin other, skin NOS, Not Reported",
adamjtaylor commented 3 months ago

Melanoma Biopsy Resection Sites is part of the MelanomaTier3 component

We count the number of distinct entries for this from google BigQuery

SELECT 
  DISTINCT(Melanoma_Biopsy_Resection_Sites), 
  COUNT(*) as n 
FROM `htan-dcc.combined_assays.MelanomaTier3` 
GROUP BY Melanoma_Biopsy_Resection_Sites
ORDER BY n DESC
Melanoma_Biopsy_Resection_Sites n
Not Reported 17
Skin of upper limb and shoulder 13
Skin of back 9
Skin NOS 8
Skin of lower limb and hip 6
Skin of scalp 3
Skin of abdomen 2
Skin of sole 1
Skin of vulva 1
Skin of ear 1
skin other 1
Skin of chest 1

These are all in first-letter-uppercase suggesting that the lowercase valid values in the data model have not been followed (maybe they were not implemented at the time?

The only weird thing is that skin other is lowercase. To confirm how this is appearing in the date model.

adamjtaylor commented 3 months ago

The only occurrence of skin other (case insensitive) in the data model is for Melanoma Biopsy Resection Sites

adamjtaylor commented 3 months ago

All the other actual values submitted for Melanoma Biopsy Resection Sites appear either in Site of Resection or Biopsy (eg Skin of lower limb and hip) or Additional Topography (eg Skin of sole)

Note Additional Topography appears to be only used in the SRRS Biospecimen component - so just for the SRRS TNP and not for general HTAN center usage

adamjtaylor commented 3 months ago

Next will look at Yes vs yes

adamjtaylor commented 2 months ago

In the CSV: "yes" is a valid value for "Treatment or Therapy" only where as title "Yes" is more frequently used

In the data model we see that upper case Yes is used in the valid value within the JSON-LD

            "@id": "bts:TreatmentorTherapy",
            "@type": "rdfs:Class",
            "rdfs:comment": "A yes/no/unknown/not applicable indicator related to the administration of therapeutic agents received.",
            "rdfs:label": "TreatmentorTherapy",
            "rdfs:subClassOf": [
                {
                    "@id": "bts:Therapy"
                }
            ],
            "schema:isPartOf": {
                "@id": "http://schema.biothings.io"
            },
            "schema:rangeIncludes": [
                {
                    "@id": "bts:Yes"
                },
                {
                    "@id": "bts:No"
                },
                {
                    "@id": "bts:Unknown"
                },
                {
                    "@id": "bts:NotReported"
                }
            ],
            "sms:displayName": "Treatment or Therapy",
            "sms:required": "sms:false",
            "sms:validationRules": []
        },
adamjtaylor commented 2 months ago

My hypothesis is that where there are case differences the JSON-LD converter is now harmonising based on the title case version. I wonder if in the past it took both, or harmonized in the lower case version.

Action for next sprint. Escalate to FAIR. Suggest @aditigopalan work with them to confirm this hypothesis or understand how cases for the JSON-LD

adamjtaylor commented 2 months ago

Looking back to Aug 2023 data model release I don't see a change in behavior

aclayton555 commented 2 months ago

This is a problem we will need to engage with FAIR Data on in the future to figure out how to clean this up based on latest expected behavior of schematic. Push this back to baclog and mark for renewal.