Closed rahulbot closed 5 years ago
For the reference, what is it that you'd like to dump again?
we need to do a dump in which we request the result of mediawords.util.url.get_url_distinctive_domain() for each media source rather than the raw url.
-hal
On Mon, Feb 25, 2019 at 12:57 AM Linas Valiukas notifications@github.com wrote:
For the reference, what is it that you'd like to dump again?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_550-23issuecomment-2D466893358&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=cm1JWNfe088hCnzZztrYHoukP2SBWqXaqdmD8IhqA3g&s=uBrrh47D-flHkm5lPaqmbcyUHH_r_E8YT0Cc4fiXhwo&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABvvTxx-2D8xZvf0e0FFCnaELnprWWs-2DDzks5vQ4k-5FgaJpZM4a-2DCOv&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=cm1JWNfe088hCnzZztrYHoukP2SBWqXaqdmD8IhqA3g&s=rWDVeiEOU5r1HGawootaZHsYqTAo-78VblxZwZQE5SY&e= .
-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University
The next task on this is to pull a set of 1000 random media URLs (via the call above) and run them against the SW api endpoint. We want to hit their "traffic" API so we get monthly visits over the last 6 months (save this JSON). The old idea was that we'd save that blob of data internally, but only return the average to display on our tool webpages.
Anissa reports that we have until 3/28, but this first task needs to happen this week @pypt so we can see if it will work well or not before starting a larger batch job. We only get 250k hits, so we'll still have to prioritize sources for this dump if this test round works.
Moved to -systems.
It sounds like our SimilarWeb account is running out soon, so we need to do a giant dump ASAP to get data before it runs out. @hroberts it sounds like you have code to do this already. Is that correct, or should I spin up a quick script based the early work Becky did for us a while back?