Issues with normalized conditions

kaneplusplus commented 2 years ago

If I run:

# A tibble: 503,362 × 2
# Groups:   nct_id [373,444]
   nct_id      condition
   <chr>       <chr>
 1 NCT01164735 Recurrence
 2 NCT01164735 Uterus
 3 NCT01164735 Carcinoma
 4 NCT01164735 Stage
 5 NCT01164735 Uterus
 6 NCT01164735 Cancer
 7 NCT01164735 Stage
 8 NCT01164735 Uterus
 9 NCT01164735 Cancer
10 NCT01198171 Stage

I get multiple single terms for a case with conditions:

> ctgov$conditions[ctgov$nct_id == 'NCT01164735']
[1] "Recurrent Uterine Corpus Carcinoma|Stage III Uterine Corpus Cancer|Stage IV Uterine Corpus Cancer"

Is there a general way to get this down to "Uterine cancer" or "Uterine carcinoma"?

Also, there are a few results that are a little wonky.

> ctgov_norm_conditions(ctgov) %>%
+   distinct() %>%
+   filter(grepl("<i>", condition))
# A tibble: 6,380 × 2
# Groups:   nct_id [6,325]
   nct_id      condition
   <chr>       <chr>
 1 NCT03838302 <i>The Gifted</i> (season 1)
 2 NCT04327310 <i>Plasmodium falciparum</i>
 3 NCT04621279 <i>Newborn monument</i>
 4 NCT04535648 <i>Enterovirus</i>
 5 NCT04251702 <i>The Gifted</i> (season 2)
 6 NCT04001946 <i>The Gifted</i> (season 2)
 7 NCT04515784 <i>Post Traumatic</i>
 8 NCT05081752 <i>Star Trek: Enterprise</i>
 9 NCT04991324 <i>Inflammatory Bowel Diseases</i>
10 NCT03643887 <i>Clostridioides difficile</i> infection

If we can't normalize the disease, it's fine to use the original condition. Can we have 1 normalized disease per condition listed?

statsmaths commented 2 years ago

The normalization works fairly well for doing aggregative analyses but I knew it has issues with specific trials. I meant to ask on Monday whether I should work on improving this (seems the answer is yes; it is also an interesting problem for me).

When I had used these before, I added an extra step after the mapping that's not included here where I choose the potential normalized condition that is most common in the entire set of trials. That blows out all of the clearly incorrect things such as movie names and other erroneous links. I think I have an approach for doing this through the look-up table and will add that to the package.

statsmaths commented 2 years ago

Now, the problem with the first example you gave of "Recurrent Uterine Corpus Carcinoma" is quite tricky. My algorithm only looks at contiguous words to see the longest set of tokens that looks like a real disease. I agree the correct answer is "Uterine Carcinoma" or "Uterine Cancer", but all of the conditions here have the word "Corpus" in the middle. This makes it never find the term we want.

A standard trick is to use skip-grams (n-grams, but allowing a break between terms), but those will also introduce a lot more false positives. Do you think this is something that fairly specific to cancers and needs to be addressed with a special logic? Or should I try to use a skip-gram approach to find diseases with compound names that might have extra words into them?

statsmaths commented 2 years ago

With the new code base I'm getting the following:

> library(ctrialsgov)
> ctgov_load_sample()
> dt <- ctgov_query()
> norms <- ctgov_norm_conditions(dt)
> norms

# A tibble: 2,616 × 3
   nct_id      condition                                norm_flag
   <chr>       <chr>                                    <lgl>
 1 NCT04332588 Breast cancer                            TRUE
 2 NCT05191797 Carcinoma                                TRUE
 3 NCT04781140 Attention deficit hyperactivity disorder TRUE
 4 NCT05183139 Multiple myeloma                         TRUE
 5 NCT05090579 Pain                                     TRUE
 6 NCT04466566 Healthy                                  FALSE
 7 NCT05232734 Depressed GCS                            FALSE
 8 NCT05160480 Prostate cancer                          TRUE
 9 NCT05160480 Breast cancer                            TRUE
10 NCT05160480 Neuroendocrine tumor                     TRUE
# … with 2,606 more rows

It always returns one result for each input study and never more than one result for each element of the conditions in the raw data. The norm_flag indicates whether I was able to normalize one of the conditions or not. Here are the studies in the sample that don't find matches:

> filter(norms, !norm_flag)

# A tibble: 374 × 3
   nct_id      condition                       norm_flag
   <chr>       <chr>                           <lgl>
 1 NCT04466566 Healthy                         FALSE
 2 NCT05232734 Depressed GCS                   FALSE
 3 NCT05208957 Perioperative Complication      FALSE
 4 NCT04893681 Dental Caries in Children       FALSE
 5 NCT05178849 Bone Marrow Biopsy              FALSE
 6 NCT05179148 Motivational Interviewing       FALSE
 7 NCT04391777 Healthy Young Adults            FALSE
 8 NCT05024188 Physical Abilities              FALSE
 9 NCT05139160 Nutrition, Healthy              FALSE
10 NCT04708704 Thyroid; Functional Disturbance FALSE
# … with 364 more rows

Most are studies of healthy patients or not interventional drug trials. I would recommend filtering on norm_flag before doing any aggregative analysis.

presagia-analytics / ctrialsgov

Issues with normalized conditions #14