Open rahulbot opened 6 years ago
And now a user has noticed this as well. She was running this query and saw this word cloud:
Any ideas?
spent ten minutes looking at this. bleacherrrport has some insanely deeply nested javascript / html code that is confusing both the extractor and html stripper. will have to dig into this in more detail.
On Tue, Mar 20, 2018 at 8:05 PM, rahulbot notifications@github.com wrote:
And now a user has noticed this as well. She was running this query https://explorer.mediacloud.org/#/queries/search?q=%5B%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%23004dcf%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B%5D,%22collections%22:%5B9139487%5D%7D,%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%23ff7f0e%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B1144%5D,%22collections%22:%5B%5D%7D,%7B%22label%22:%22Trinidad%22,%22q%22:%22Trinidad%22,%22color%22:%22%232ca02c%22,%22startDate%22:%222017-03-20%22,%22endDate%22:%222018-03-19%22,%22sources%22:%5B%5D,%22collections%22:%5B34412405%5D%7D%5D and saw this word cloud: [image: unnamed-2] https://user-images.githubusercontent.com/673178/37690586-43728d56-2c82-11e8-9319-f77331020b98.png
Any ideas?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/339#issuecomment-374808028, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT-3A1C9hcdpCQ3wI8gOwoAIqPWPpks5tgadWgaJpZM4SxGRS .
-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University
For now I made a duplicate collection of the US Top 50 Online, minus Bleacher Report, for the user.
I don't think there is something we can do about this, so I'm taking it off our schedule (but leaving it open)
I did a search on explore and saw a few odd terms in the word cloud, like "media_width", "media_height", etc. So then I did a search for "media_width" and in the sampled stories saw that they are all from bleacherreport.com (media_id=24901) or CNN.
This doesn't feel critical, because I can easily ignore them, but I'm noting it in case it is an instance of some more important underlying problem.
Here is a CSV of the 115 stories that matched this in the last two weeks of the US Top Online collection: media_width-stories-20180319200754.csv.zip