mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Not capturing full article text #80

Closed jaypinho closed 9 months ago

jaypinho commented 9 months ago

Example code:

from mcmetadata import extract
metadata = extract(url="https://www.nytimes.com/2024/02/11/world/europe/ukraine-soldier-draft.html")
print(metadata['text_content'])

It returns article text up until "...which is fighting the largest war in Europe since World War II." But the article continues beyond that.

PS I know this is a very difficult / complex problem to solve in a generalizable and scalable way - kudos for all the hard work!

rahulbot commented 9 months ago

This is probably due to the NYTimes paywall, not any error in this code. We don't do anything to get around paywalls. Closing at out-of-scope. Please re-open if you find an example on a non-paywalled website.