timo-b-roettger / MoBa_Transparency

The associated repository for the MoBa Transparency project
0 stars 0 forks source link

Decide on sampling strategy #1

Open timo-b-roettger opened 1 year ago

timo-b-roettger commented 1 year ago

Possible strats:

Using Prisma as guideline: http://prisma-statement.org/prismastatement/flowdiagram.aspx?AspxAutoDetectCookieSupport=1

timo-b-roettger commented 1 year ago

Pravesh's notes here.

timo-b-roettger commented 1 year ago

UPDATE:

We will use the list from MoBa to avoid false negatives and tedious filtering procedures as a baseline and sample from that To do: get the list into machine readable format

We need a sampling strategy for the covariate publication year a. either define an early and late group and sample equally from both bins b. or sample from each year equally (as good as possible). To decide: a or b.

parekhpravesh commented 1 year ago

I had included a MATLAB code (parseHTML.txt) that parses the HTML page from MoBa and creates a new csv file - maybe a useful starting point? [https://github.com/troettge/MoBa_Transparency/issues/5#issuecomment-1557646711]

parekhpravesh commented 1 year ago

Here is a zip file containing the following:

parseHTML-MoBaPublications.zip

MaxKorbmacher commented 1 year ago

For accessibility, I translated @parekhpravesh code from MATLAB to Python. Note: local paths need to be adjusted!

import pandas as pd
from bs4 import BeautifulSoup

# Initial parsing
with open('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/PageSource_MoBaPublications_22May2023.html', 'r') as file:
    code = file.read()

soup = BeautifulSoup(code, 'html.parser')
text = soup.get_text()
txt = text.split('\n')

# Get rid of extra content
startPoint = txt.index('2022')
endPoint = txt.index('Related Articles')
toParse = txt[startPoint:endPoint]

# Get rid of years
toRemove = [str(year) for year in range(2022, 2005, -1)]
toParse = [line for line in toParse if line not in toRemove]

# Get rid of lines with number of articles
toParse = [line for line in toParse if not re.match(r'\d*articles', line)]

# Remove HTML tags
toParse = [re.sub(r'<.*?>', '', line) for line in toParse]

# Remove "(no pagination)" and any content after that
toParse = [re.sub(r'\(no pagination\).*', '', line) for line in toParse]

# Extract year
allYears = []
for line in toParse:
    locs_allYears = re.findall(r'\(\d{4}', line)
    if len(locs_allYears) == 1:
        allYears.append(int(locs_allYears[0][1:]))
    elif len(locs_allYears) > 1:
        allYears.extend([int(year[1:]) for year in locs_allYears])

# For the four entries where two years are detected, clean up
allYears = [years if years <= 2023 else float('nan') for years in allYears]
allYears = [years if years >= 2006 else float('nan') for years in allYears]
allYears = [years[0] if years[0] != years[1] else years[0] for years in zip(allYears, allYears[1:])]

# Extract article names
articleNames = []
for line in toParse:
    locQuotes = re.findall(r'"(.*?)"', line)
    if locQuotes:
        articleNames.append(locQuotes[0])

# Handle the case of incorrectly parsed articles
toRedo = [i for i, name in enumerate(articleNames) if len(name) < 10]
for loc in toRedo:
    currStr = toParse[loc]
    tmp_loc_year = re.findall(r'\(\d{4}\.', currStr)
    if len(tmp_loc_year) == 1:
        startPos = tmp_loc_year[0].index(')') + 2
        remStr = currStr[startPos:]
        endPos = remStr.index('.')
        articleNames[loc] = remStr[:endPos]

# Minor clean up - remove leading spaces in article names
articleNames = [re.sub(r'^\s+', '', name) for name in articleNames]

# Remove quotation marks at the end of the article name
articleNames = [name.strip('"') for name in articleNames]

# Turn everything to lower case
articleNamesLow = [name.lower() for name in articleNames]

# Filter out articles which are already present in PubMed
pubMed = pd.read_csv('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/Search-PubMed-Filtered_lt2005.csv')
pubMedTitles = pubMed['Title'].str.lower()
isFound = []

for line in articleNamesLow:
    try:
        locs = pubMedTitles.str.find(line[:50])
    except:
        locs = pubMedTitles.str.find(line)

    locs = locs[locs != -1]
    if len(locs) == 1:
        title = articleNames[articleNamesLow.index(line)]
        pubMedTitle = pubMed['Title'].iloc[locs.index[0]]
        isFound.append([title, pubMedTitle, locs.index[0]])

locs_notPubMed = [index for index, name in enumerate(articleNamesLow) if name not in pubMedTitles.tolist()]
notPubMed = [[articleNames[i], str(allYears[i]), toParse[i]] for i in locs_notPubMed]

# Write out a text file for these papers which are not in PubMed
notPubMed_df = pd.DataFrame(notPubMed, columns=['ApproximateTitle', 'Year', 'TexttoParse'])
notPubMed_df.to_csv('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/additionalPapers_HTML.csv', sep='\t', index=False)
parekhpravesh commented 1 year ago

Not sure if I shared them before, but the following are attached here:

parekhpravesh commented 1 year ago

Complete notes on sampling of papers: https://github.com/troettge/MoBa_Transparency/issues/5#issuecomment-1557646711

MaxKorbmacher commented 1 year ago

Sorry for pushing the complete notes entry up again, but here is also the python code of the parsing script. We can link to both code versions later in the preregistration, e.g., by uploading to our OSF repository.

import pandas as pd
import re

# Read data from Europe PMC
europmc = pd.read_csv('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/Search-EuropePMC.csv')

# Filter papers that are older than 2005
toDelete_year = (europmc['PUBLICATION_YEAR'] > 0) & (europmc['PUBLICATION_YEAR'] < 2005)

# Filter patents, agricola, EthOs theses, NHS evidence, and Europe PMC book metadata
toDelete_type = europmc['SOURCE'].isin(['AGR', 'ETH', 'HIR', 'NBK', 'PAT'])

# Remove junk
junkCategories = ['Abstracts', 'Full Issue PDF.', 'Peer-reviewed Abstracts of Scientific Paper Presentation at The 58 Annual Conference of The West African College of Surgeons at Banjul, The Gambia 26th February To 2nd March, 2018.',
                  'Physicians Poster Sessions', 'Poster Presentations', 'Poster Sessions', 'Posters', 'Scientific Abstracts',
                  'Volume Contents', 'Congress', 'Conference', 'Symposium', 'Abstracts', 'Society', 'Meeting', 'Presentation',
                  '20th ECP', 'Seminar', 'Abstract', 'Session']

isJunk = europmc['TITLE'].str.lower().str.contains('|'.join(map(re.escape, junkCategories)))

# Remaining articles
europmc = europmc[~(toDelete_year | toDelete_type | isJunk)]

# Duplicate titles
dupTitles = europmc['TITLE'][europmc['TITLE'].duplicated()].unique()

# Check if preprint is available and delete it (Only ends up removing one)
for title in dupTitles:
    tempLoc = europmc.index[europmc['TITLE'] == title]
    idenPPR = europmc.loc[tempLoc, 'SOURCE'] == 'PPR'
    if idenPPR.sum() < len(tempLoc):
        europmc = europmc.drop(tempLoc[idenPPR])

# Check overlap with PubMed
pubMed = pd.read_csv('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/Search-PubMed-Filtered_lt2005.csv')
europmcTitles = europmc['TITLE'].str.lower().str.replace(r'<.*?>', '', regex=True)
pubMedTitles = pubMed['Title'].str.lower()

isFound = []

for line in europmcTitles:
    try:
        tmp = pubMedTitles.str.find(line[:50])
    except:
        tmp = pubMedTitles.str.find(line)
    locs = tmp[tmp != -1]

    if len(locs) == 1:
        isFound.append([europmc.loc[europmcTitles[europmcTitles == line].index[0], 'TITLE'], pubMed.loc[locs.index[0], 'Title'], locs.index[0]])

locs_notPubMed = [i for i, found in enumerate(isFound) if len(found) == 0]
notPubMed = europmc.iloc[locs_notPubMed]

# Do another round of filtering based on DOI
allEuroDOI = notPubMed['DOI']

# Remove empty DOI
allEuroDOI = allEuroDOI.dropna()

isSameDOI = allEuroDOI[allEuroDOI.isin(pubMed['DOI'])]

# Delete entries that have these DOIs
notPubMed = notPubMed[~notPubMed['DOI'].isin(isSameDOI)]

# Write out table as a new file
notPubMed.to_csv('/Applications/Projects/2023-05-22_MoBaTransparency/parsing/Search_EuropePMC_filtered_notPubMed.csv', index=False)