mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

do a quick dump from similar web #550

Closed rahulbot closed 5 years ago

rahulbot commented 5 years ago

It sounds like our SimilarWeb account is running out soon, so we need to do a giant dump ASAP to get data before it runs out. @hroberts it sounds like you have code to do this already. Is that correct, or should I spin up a quick script based the early work Becky did for us a while back?

pypt commented 5 years ago

For the reference, what is it that you'd like to dump again?

hroberts commented 5 years ago

we need to do a dump in which we request the result of mediawords.util.url.get_url_distinctive_domain() for each media source rather than the raw url.

-hal

On Mon, Feb 25, 2019 at 12:57 AM Linas Valiukas notifications@github.com wrote:

For the reference, what is it that you'd like to dump again?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_550-23issuecomment-2D466893358&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=cm1JWNfe088hCnzZztrYHoukP2SBWqXaqdmD8IhqA3g&s=uBrrh47D-flHkm5lPaqmbcyUHH_r_E8YT0Cc4fiXhwo&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABvvTxx-2D8xZvf0e0FFCnaELnprWWs-2DDzks5vQ4k-5FgaJpZM4a-2DCOv&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=cm1JWNfe088hCnzZztrYHoukP2SBWqXaqdmD8IhqA3g&s=rWDVeiEOU5r1HGawootaZHsYqTAo-78VblxZwZQE5SY&e= .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

rahulbot commented 5 years ago

The next task on this is to pull a set of 1000 random media URLs (via the call above) and run them against the SW api endpoint. We want to hit their "traffic" API so we get monthly visits over the last 6 months (save this JSON). The old idea was that we'd save that blob of data internally, but only return the average to display on our tool webpages.

Anissa reports that we have until 3/28, but this first task needs to happen this week @pypt so we can see if it will work well or not before starting a larger batch job. We only get 250k hits, so we'll still have to prioritize sources for this dump if this test round works.

pypt commented 5 years ago

Moved to -systems.