mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Not capturing full article text #80

Closed jaypinho closed 10 months ago

jaypinho commented 10 months ago

Example code:

from mcmetadata import extract
metadata = extract(url="https://www.nytimes.com/2024/02/11/world/europe/ukraine-soldier-draft.html")
print(metadata['text_content'])

It returns article text up until "...which is fighting the largest war in Europe since World War II." But the article continues beyond that.

PS I know this is a very difficult / complex problem to solve in a generalizable and scalable way - kudos for all the hard work!

rahulbot commented 10 months ago

This is probably due to the NYTimes paywall, not any error in this code. We don't do anything to get around paywalls. Closing at out-of-scope. Please re-open if you find an example on a non-paywalled website.