mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

CNN / bleacher report not parsing HTML well? #339

Open rahulbot opened 6 years ago

rahulbot commented 6 years ago

I did a search on explore and saw a few odd terms in the word cloud, like "media_width", "media_height", etc. So then I did a search for "media_width" and in the sampled stories saw that they are all from bleacherreport.com (media_id=24901) or CNN.

This doesn't feel critical, because I can easily ignore them, but I'm noting it in case it is an instance of some more important underlying problem.

Here is a CSV of the 115 stories that matched this in the last two weeks of the US Top Online collection: media_width-stories-20180319200754.csv.zip

rahulbot commented 6 years ago

And now a user has noticed this as well. She was running this query and saw this word cloud: unnamed-2

Any ideas?

hroberts commented 6 years ago

spent ten minutes looking at this. bleacherrrport has some insanely deeply nested javascript / html code that is confusing both the extractor and html stripper. will have to dig into this in more detail.

On Tue, Mar 20, 2018 at 8:05 PM, rahulbot notifications@github.com wrote:

And now a user has noticed this as well. She was running this query https://explorer.mediacloud.org/#/queries/search?q=%5B%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%23004dcf%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B%5D,%22collections%22:%5B9139487%5D%7D,%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%23ff7f0e%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B1144%5D,%22collections%22:%5B%5D%7D,%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%232ca02c%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B%5D,%22collections%22:%5B34412405%5D%7D%5D and saw this word cloud: [image: unnamed-2] https://user-images.githubusercontent.com/673178/37690586-43728d56-2c82-11e8-9319-f77331020b98.png

Any ideas?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/339#issuecomment-374808028, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT-3A1C9hcdpCQ3wI8gOwoAIqPWPpks5tgadWgaJpZM4SxGRS .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

rahulbot commented 6 years ago

For now I made a duplicate collection of the US Top 50 Online, minus Bleacher Report, for the user.

rahulbot commented 6 years ago

I don't think there is something we can do about this, so I'm taking it off our schedule (but leaving it open)