Closed kaneplusplus closed 1 year ago
The normalization works fairly well for doing aggregative analyses but I knew it has issues with specific trials. I meant to ask on Monday whether I should work on improving this (seems the answer is yes; it is also an interesting problem for me).
When I had used these before, I added an extra step after the mapping that's not included here where I choose the potential normalized condition that is most common in the entire set of trials. That blows out all of the clearly incorrect things such as movie names and other erroneous links. I think I have an approach for doing this through the look-up table and will add that to the package.
Now, the problem with the first example you gave of "Recurrent Uterine Corpus Carcinoma" is quite tricky. My algorithm only looks at contiguous words to see the longest set of tokens that looks like a real disease. I agree the correct answer is "Uterine Carcinoma" or "Uterine Cancer", but all of the conditions here have the word "Corpus" in the middle. This makes it never find the term we want.
A standard trick is to use skip-grams (n-grams, but allowing a break between terms), but those will also introduce a lot more false positives. Do you think this is something that fairly specific to cancers and needs to be addressed with a special logic? Or should I try to use a skip-gram approach to find diseases with compound names that might have extra words into them?
With the new code base I'm getting the following:
> library(ctrialsgov)
> ctgov_load_sample()
> dt <- ctgov_query()
> norms <- ctgov_norm_conditions(dt)
> norms
# A tibble: 2,616 × 3
nct_id condition norm_flag
<chr> <chr> <lgl>
1 NCT04332588 Breast cancer TRUE
2 NCT05191797 Carcinoma TRUE
3 NCT04781140 Attention deficit hyperactivity disorder TRUE
4 NCT05183139 Multiple myeloma TRUE
5 NCT05090579 Pain TRUE
6 NCT04466566 Healthy FALSE
7 NCT05232734 Depressed GCS FALSE
8 NCT05160480 Prostate cancer TRUE
9 NCT05160480 Breast cancer TRUE
10 NCT05160480 Neuroendocrine tumor TRUE
# … with 2,606 more rows
It always returns one result for each input study and never more than one result for each element of the conditions in the raw data. The norm_flag
indicates whether I was able to normalize one of the conditions or not. Here are the studies in the sample that don't find matches:
> filter(norms, !norm_flag)
# A tibble: 374 × 3
nct_id condition norm_flag
<chr> <chr> <lgl>
1 NCT04466566 Healthy FALSE
2 NCT05232734 Depressed GCS FALSE
3 NCT05208957 Perioperative Complication FALSE
4 NCT04893681 Dental Caries in Children FALSE
5 NCT05178849 Bone Marrow Biopsy FALSE
6 NCT05179148 Motivational Interviewing FALSE
7 NCT04391777 Healthy Young Adults FALSE
8 NCT05024188 Physical Abilities FALSE
9 NCT05139160 Nutrition, Healthy FALSE
10 NCT04708704 Thyroid; Functional Disturbance FALSE
# … with 364 more rows
Most are studies of healthy patients or not interventional drug trials. I would recommend filtering on norm_flag
before doing any aggregative analysis.
If I run:
I get multiple single terms for a case with conditions:
Is there a general way to get this down to "Uterine cancer" or "Uterine carcinoma"?
Also, there are a few results that are a little wonky.
If we can't normalize the disease, it's fine to use the original condition. Can we have 1 normalized disease per condition listed?