mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

assess adoption of sitemap-news file #683

Closed rahulbot closed 4 years ago

rahulbot commented 4 years ago

One workaround to #605 is to see if News sitemaps have been widely adopted. Google has a spec for one. A few questions:

If this file is in use a lot (it feeds Google News), then we can avoid the question of deciding what is news and what isn't.

hroberts commented 4 years ago

my guess is that this will not be definitive, but it should be a strong feature at the least.

-hal

On Fri, Mar 13, 2020 at 10:32 AM rahulbot notifications@github.com wrote:

One workaround to #605 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_605&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=SOmMeax577frwA_HMN97_gXJuCRpJM4yMukybXNQ0i0&s=cdISbanWr9tuga4VGLfiM84vg6E4WuHtyY9elgeL_r8&e= is to see if News sitemaps have been widely adopted. Google has a spec for one https://urldefense.proofpoint.com/v2/url?u=https-3A__support.google.com_news_publisher-2Dcenter_answer_9606710-3Fhl-3Den&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=SOmMeax577frwA_HMN97_gXJuCRpJM4yMukybXNQ0i0&s=z0L8xIH2McZsYKFLWU2JHmh7E12z6G-bMoVP_s9SZbA&e=. A few questions:

  • How many sitemaps from a random global sample of media sources include this xmlns:news property?
  • How many sitemaps are hosted at the url /sitemap_news.xml or /sitemap-news.xml or some variant thereof? For instance, I found https://www.semana.com/sitemap_news.xml.

If this file is in use a lot (it feeds Google News), then we can avoid the question of deciding what is news and what isn't.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_683&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=SOmMeax577frwA_HMN97_gXJuCRpJM4yMukybXNQ0i0&s=xkXpLaIlGEan6ghETINFHJVhncGKG27MF4SpkewCP2s&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2Q3TBCMDPMN7KZHTLRHI7XRANCNFSM4LHDXDYQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=SOmMeax577frwA_HMN97_gXJuCRpJM4yMukybXNQ0i0&s=c5obkCJIatMt_ySTwS_zDA2E4f_Oy1JWPPdNKmLKhqQ&e= .

pypt commented 4 years ago

I can come up with some definite numbers if you'd like me to, but without running slow SQL queries:

How many sitemaps from a random global sample of media sources include this xmlns:news property?

Not all of them - only the main news websites. To add to that, a lot of websites only list X of the most recent news articles in the Google News XML sitemap (similar to RSS) and publish their historical archive as a non-News sitemap hierarchy.

How many sitemaps are hosted at the url /sitemap_news.xml or /sitemap-news.xml or some variant thereof? For instance, I found https://www.semana.com/sitemap_news.xml.

I'm not quite sure because when we did the original crawl, we didn't log the sitemap URL from which the potential story URL originated from. Ultimate sitemap parser tests both robots.txt for Sitemap: directives and then blindly tries a few URLs which might contain sitemaps too. sitemap[_-]news.xml wasn't originally in this list but I've just added it.

hroberts commented 4 years ago

I wonder if the thing to do is just to create a simple single medium moderation system that collects the potential stories for the site and shows basic stats along with a 1000 or so sample positive / negative hits for the media source. Linas can just walk through one by one to confirm that each one makes sense, and adjust the heuristics (or re-model the ML) as he moderates each one. We could start with a handful of countries and then change the moderation plan (probably to something automated that only flags certain large or other problematic media sources for review).

This would let us quickly start using the sitemaps in production and also get good production experience with them while minimizing the chance of importing 100 million dating profiles. We would likely always have some sort of brake in place for a new media source with more than some threshold of potential new stories to avoid the dating profiles problem.

-hal

On Tue, Mar 17, 2020 at 5:41 AM Linas Valiukas notifications@github.com wrote:

I can come up with some definite numbers if you'd like me to, but without running slow SQL queries:

How many sitemaps from a random global sample of media sources include this xmlns:news property?

Not all of them - only the main news websites. To add to that, a lot of websites only list X of the most recent news articles in the Google News XML sitemap (similar to RSS) and publish their historical archive as a non-News sitemap hierarchy.

How many sitemaps are hosted at the url /sitemap_news.xml or /sitemap-news.xml or some variant thereof? For instance, I found https://www.semana.com/sitemap_news.xml https://urldefense.proofpoint.com/v2/url?u=https-3A__www.semana.com_sitemap-5Fnews.xml&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=mh7hpESmTtic9bBRdTNieE0_YuwmK6HCovf_K4Qpl1Q&s=STis4tMS3_yxDC4DyCCZOjkVK81cDxU8MT4ZKlBsD1Q&e= .

I'm not quite sure because when we did the original crawl, we didn't log the sitemap URL from which the potential story URL originated from. Ultimate sitemap parser https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud-2Dultimate-2Dsitemap-2Dparser&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=mh7hpESmTtic9bBRdTNieE0_YuwmK6HCovf_K4Qpl1Q&s=sEkVDceUK-S0x3u7Fe8ClgFXppizMnBgyqIYoLJO6uM&e= tests both robots.txt for Sitemap: directives and then blindly tries a few URLs https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud-2Dultimate-2Dsitemap-2Dparser_blob_5c9c3bf424192b615cd9dba7c810b0c140c35a55_usp_tree.py-23L14-2DL26&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=mh7hpESmTtic9bBRdTNieE0_YuwmK6HCovf_K4Qpl1Q&s=KyWPLdE5EErs0YkDA2zipUlNBiTXm2mWPQgIwlj_lrU&e= which might contain sitemaps too. sitemap[_-]news.xml wasn't originally in this list but I've just added it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_683-23issuecomment-2D600000162&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=mh7hpESmTtic9bBRdTNieE0_YuwmK6HCovf_K4Qpl1Q&s=tAKE7-LqdP-2kj_DrzQcg5zaiv2TCcPYLa7LUGkKsAA&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T3WAHFBRXCU5E5ZVS3RH5HV5ANCNFSM4LHDXDYQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=mh7hpESmTtic9bBRdTNieE0_YuwmK6HCovf_K4Qpl1Q&s=5AT-2HbaF-G3tE5h1i5KrFCotD2jUOTk4m5PDgjnLUw&e= .

rahulbot commented 4 years ago

Moved this idea to #690