shahdevansh / author-extraction

Author extraction experiment for Media cloud
0 stars 0 forks source link

Mediacloud Assignment
Devansh Shah
IIIT Hyderabad.

Project: Experiment for assessing python candidate libraries for Author Extraction from online articles - Media Cloud assignment

Aim: Given a dataset of extracted articles and labeled Authors, assess the viability of two Python 3 libraries, namely Newspaper & Goose and provide a recomendation with baseline results.

Experiment - 1 (MVP): Install both Newspaper and Goose in a python3 virtualenv and parse a sample article to extract title & authore.

Results (MVP):
Both libraries are successfully able to fetch the article & parse it.
Both libraries are able to correctly parse the title & article text.
Goose is unable to parse the author.
Newspaper incorrectly parses the authors from the comments section.
Attached image from MVP
alt text

Experiment - 2 (Sample size: 186)



