mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Assess tweaks to content extraction to remove headlines at end of article #86

Open rahulbot opened 6 months ago

rahulbot commented 6 months ago

After some digging on https://github.com/mediacloud/story-indexer/issues/278 it looks like tweaking of integration of Trafilatura to use favor_precision=True could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.

Test case (change the favor_precision variable to see results):

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False,
                                        favor_precision=True)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
rahulbot commented 6 months ago

(started testing work on the feature-favor-precision branch)

rahulbot commented 4 months ago

@pgulley can you queue this up to re-asses with the test code vis-a-vis the comment at https://github.com/adbar/trafilatura/issues/584#issuecomment-2233846387