verena-schwaiger / dmpethics

Making ethics approval applications machine-actionable.
MIT License
0 stars 0 forks source link

Regex query PubMedCentral results #1

Open verena-schwaiger opened 5 years ago

verena-schwaiger commented 5 years ago

For first try: used XQuery and saved IDs into txt file.

Regex: At least one letter followed by at least 3 numbers.

let $doc1 := doc("pmc_result.xml")
let $regex := "[A-Z]+[0-9]{3,}"
for $paragraph in $doc1/pmc-articleset/article/body/sec/p
  let $text := data($paragraph)
  return data(analyze-string($paragraph, $regex)/fn:match)
Daniel-Mietchen commented 5 years ago

This regex mostly yields grant IDS or accession numbers for things like clinical trials or metabolites, while missing most of the few ethics-related identifiers.

For instance, the result "ISRCTN32823720" comes from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909152/ , which states

The proposed research received ethics approval from the York Research Ethics Committee (08/H1311/66). The trial recruited 233 patients between November 2008 and June 2009. Written informed consent has been obtained from all participants. Published results are expected in 2011.

It is this "08/H1311/66" that we would be interested in.

So perhaps limit the search only to paragraphs containing relevant trigger strings (e.g. "ethic" or "review" or "approv" or "consent") and then fish for results using some different regex(es), e.g. for anything that has both a number character and a letter character?

verena-schwaiger commented 5 years ago

Here is the updated query. One for the body and one for the abstract:

  1. For paragraphs: let $doc1 := doc("pmc_result.xml") let $regex := "[A-Z]+[0-9]+" for $paragraph in $doc1/pmc-articleset/article/body/sec/p where contains(string($paragraph), "received ethics approval") return (data(analyze-string($paragraph, $regex)/fn:match) )

  2. For abstracts (which also often contain IDs): let $doc1 := doc("pmc_result.xml") let $regex := "[A-Z]+[0-9]+" for $paragraph in $doc1/pmc-articleset/article/front/article-meta/abstract where contains(string($paragraph), "received ethics approval") return (data(analyze-string($paragraph, $regex)/fn:match) )

verena-schwaiger commented 5 years ago

Results of both (bolded leads to ethics approval number, I listed the true approval number whenever the ea# is not the bolded number)

HC17021 (https://www.ncbi.nlm.nih.gov/pubmed/30007927 ) ACTRN12617000548336 ISRCTN80672011 (https://www.ncbi.nlm.nih.gov/pubmed/23398957, ea#: 4/13/03/00/09) ISRCTN68690577 NCT01869855 NCT02270138 NCT02285790 NCT02330588 NCT02394119 NCT01935544 HREC706 (https://www.ncbi.nlm.nih.gov/pubmed/27852724) ACTRN12616000459426 CRD42013006479 H8767 (https://www.ncbi.nlm.nih.gov/pubmed/25304191) ACTRN12611000253909 NCT01494181 PDQ39 ACTRN12612000440820 (https://www.ncbi.nlm.nih.gov/pubmed/24114370, ea#: CF11/2662-2011001553) ACTRN12608000270314 (https://www.ncbi.nlm.nih.gov/pubmed/19473546, ea#: HREC10452) NCT02813343 H8561 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4327952/) SSI2016 FHEC14 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6127267/, ea#: FHEC14/155) NCT02616900 NCT02979964 GUI24 NCT00818597 NCT02604056 H10202 NCT01258985 MRE00 ISRCTN46035546 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3063252/, ea#: 10/H1211/2) ISRCTN13968779 AAA9286 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3546889/) HUM00043487 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3847163/) LM010832 EC1010 NCT01850875 MD004930 C2015263 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5548199/, ea#: C2015263 [1720]) REB152296 (https://bmjopen.bmj.com/content/7/7/e017012, contains ea# for all involved institutions!) REB16 HS18939 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5507480/) ACTRN12614000644662 REB13 DA12065 M120368 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4824385/, secondary analysis!) Q1506 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4994227/, ea#: 07/Q1506/61) NCT01503814 H0505 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5593813/, ea#: 09/H0505/1) AUG2007 D05403 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2427032/, ea#: HE24AUG2007-D05403) H0306 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4385970/, ea#: 10/H0306/50) AAA9286 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3214073/) NCT00818597 H1311