skoval / RISmed

RISmed is an R package for downloading and analyzing data from the NCBI databases
38 stars 24 forks source link

Regex to remove labels from abstract body. #20

Open bes827 opened 3 years ago

bes827 commented 3 years ago

I am now trying to use a regex to remove the labels from the abstract body so I can get the "clean" text. For example, need to remove Label=\"INTRODUCTION\" NlmCategory=\"BACKGROUND\": as well as all the labels for other abstract sections (methods, results and conclusion)

I tried the following regex but did not work: x$Abstract= stringr::str_remove_all(x$Abstract, "[A-Z]+\:")
do you have any suggestions?

thank you

this is an example:

x$Abstract ="                 Label=\"INTRODUCTION\" NlmCategory=\"BACKGROUND\":Tenapanor is a first-in-class, minimally absorbed, small-molecule inhibitor of the gastrointestinal sodium/hydrogen exchanger isoform 3. This phase 3 trial assessed the long-term efficacy and safety of tenapanor 50 mg b.i.d. for the treatment of patients with irritable bowel syndrome with constipation (IBS-C).                 Label=\"METHODS\" NlmCategory=\"METHODS\":In this randomized double-blind study (ClinicalTrials.gov identifier: NCT02686138), patients with IBS-C received tenapanor 50 mg b.i.d. or placebo b.i.d. for 26 weeks. The primary endpoint was the proportion of patients who had a reduction of ≥30.0% in average weekly worst abdominal pain and an increase of ≥1 weekly complete spontaneous bowel movement from baseline, both in the same week, for ≥6 of the first 12 treatment weeks (6/12-week combined responder).                 Label=\"RESULTS\" NlmCategory=\"RESULTS\":Of the 620 randomized patients with IBS-C, 593 (95.6%) were included in the intention-to-treat analysis set (tenapanor: n = 293; placebo: n = 300) and 481 patients (77.6%) completed the 26-week treatment period. In the intention-to-treat analysis set (mean age: 45.4 years; 82.1% women), a significantly greater proportion of patients treated with tenapanor were 6/12-week combined responders than those treated with placebo (36.5% vs 23.7%; P < 0.001). Abdominal symptoms and global symptoms of IBS were significantly improved with tenapanor compared with placebo. Diarrhea, the most common adverse event, was typically transient and mild to moderate in severity. Diarrhea led to study drug discontinuation for 19 (6.5%) and 2 patients (0.7%) receiving tenapanor and placebo, respectively.                 Label=\"DISCUSSION\" NlmCategory=\"CONCLUSIONS\":Tenapanor 50 mg b.i.d. improved IBS-C symptoms over 26 weeks and was generally well tolerated, offering a potential new long-term treatment option for patients with IBS-C (see Visual abstract, Supplementary Digital Content 1, http://links.lww.com/AJG/B797). "

x$Abstract= stringr::str_remove_all(x$Abstract, "[A-Z]+\\:")   
bes827 commented 3 years ago

I now tried just typing the exact text I need removed (rather than regex as these are only 4) and seems like it is working

pm$Abstract= stringr::str_remove_all(pm$Abstract, "Label=\"INTRODUCTION\" NlmCategory=\"BACKGROUND\":")

pm$Abstract= stringr::str_remove_all(pm$Abstract, "Label=\"METHODS\" NlmCategory=\"METHODS\":")

pm$Abstract= stringr::str_remove_all(pm$Abstract, "Label=\"RESULTS\" NlmCategory=\"RESULTS\":")

pm$Abstract= stringr::str_remove_all(pm$Abstract, "Label=\"DISCUSSION\" NlmCategory=\"CONCLUSIONS\":")
skoval commented 3 years ago

Which version of RISmed are you using? And please provide an example PMID to reproduce the type of Abstract you are describing.

bes827 commented 3 years ago

I am using version 2.2

and this is the PMID of the abstract above: 31934897

thank you

bes827 commented 3 years ago

@skoval is there a way to install the previous version of RISmed instead of the current one? When I go the archive, I can only find the versions from 2017 https://cran.r-project.org/src/contrib/Archive/RISmed/

thank you