tag media sources with country they are most about

rahulbot commented 8 years ago

To contribute to the metadata the research team wants on sources, we should automatically tag a source with the country it talks about most.

To me, this means writing a script that:

fetches sources from Media Cloud via the API
verifies if they have over some minimum number of stories
calls sentences/field_count to see what country our geotagging pipeline says their stories are about most often
calls some new API endpoint to add a "country-tag" to a media source (from a "source-about" tag set)

rahulbot commented 8 years ago

Here are two lines that can help:

1) This queries to fetch the top country a media source writes about:

mc.sentenceFieldCount('media_id:39872',field='tags_id_stories', tag_sets_id='1011', sample_size=10000)[0]

Returning results like this (for the Hindustan Times):

{"count": 3348,
 "label": "Republic of India",
 "tag": "geonames_1269750",
 "tag_sets_id": 1011,
 "tags_id": 8878677}

2) This line maps geonames ID to ISO 3166-1 alpha-3

COUNTRY_GEONAMES_ID_TO_APLHA3 = {3041565:"AND",290557:"ARE",1149361:"AFG",3576396:"ATG",3573511:"AIA",783754:"ALB",174982:"ARM",3351879:"AGO",6697173:"ATA",3865483:"ARG",5880801:"ASM",2782113:"AUT",2077456:"AUS",3577279:"ABW",661882:"ALA",587116:"AZE",3277605:"BIH",3374084:"BRB",1210997:"BGD",2802361:"BEL",2361809:"BFA",732800:"BGR",290291:"BHR",433561:"BDI",2395170:"BEN",3578476:"BLM",3573345:"BMU",1820814:"BRN",3923057:"BOL",7626844:"BES",3469034:"BRA",3572887:"BHS",1252634:"BTN",3371123:"BVT",933860:"BWA",630336:"BLR",3582678:"BLZ",6251999:"CAN",1547376:"CCK",203312:"COD",239880:"CAF",2260494:"COG",2658434:"CHE",2287781:"CIV",1899402:"COK",3895114:"CHL",2233387:"CMR",1814991:"CHN",3686110:"COL",3624060:"CRI",3562981:"CUB",3374766:"CPV",7626836:"CUW",2078138:"CXR",146669:"CYP",3077311:"CZE",2921044:"DEU",223816:"DJI",2623032:"DNK",3575830:"DMA",3508796:"DOM",2589581:"DZA",3658394:"ECU",453733:"EST",357994:"EGY",2461445:"ESH",338010:"ERI",2510769:"ESP",337996:"ETH",660013:"FIN",2205218:"FJI",3474414:"FLK",2081918:"FSM",2622320:"FRO",3017382:"FRA",2400553:"GAB",2635167:"GBR",3580239:"GRD",614540:"GEO",3381670:"GUF",3042362:"GGY",2300660:"GHA",2411586:"GIB",3425505:"GRL",2413451:"GMB",2420477:"GIN",3579143:"GLP",2309096:"GNQ",390903:"GRC",3474415:"SGS",3595528:"GTM",4043988:"GUM",2372248:"GNB",3378535:"GUY",1819730:"HKG",1547314:"HMD",3608932:"HND",3202326:"HRV",3723988:"HTI",719819:"HUN",1643084:"IDN",2963597:"IRL",294640:"ISR",3042225:"IMN",1269750:"IND",1282588:"IOT",99237:"IRQ",130758:"IRN",2629691:"ISL",3175395:"ITA",3042142:"JEY",3489940:"JAM",248816:"JOR",1861060:"JPN",192950:"KEN",1527747:"KGZ",1831722:"KHM",4030945:"KIR",921929:"COM",3575174:"KNA",1873107:"PRK",1835841:"KOR",831053:"XKX",285570:"KWT",3580718:"CYM",1522867:"KAZ",1655842:"LAO",272103:"LBN",3576468:"LCA",3042058:"LIE",1227603:"LKA",2275384:"LBR",932692:"LSO",597427:"LTU",2960313:"LUX",458258:"LVA",2215636:"LBY",2542007:"MAR",2993457:"MCO",617790:"MDA",3194884:"MNE",3578421:"MAF",1062947:"MDG",2080185:"MHL",718075:"MKD",2453866:"MLI",1327865:"MMR",2029969:"MNG",1821275:"MAC",4041468:"MNP",3570311:"MTQ",2378080:"MRT",3578097:"MSR",2562770:"MLT",934292:"MUS",1282028:"MDV",927384:"MWI",3996063:"MEX",1733045:"MYS",1036973:"MOZ",3355338:"NAM",2139685:"NCL",2440476:"NER",2155115:"NFK",2328926:"NGA",3617476:"NIC",2750405:"NLD",3144096:"NOR",1282988:"NPL",2110425:"NRU",4036232:"NIU",2186224:"NZL",286963:"OMN",3703430:"PAN",3932488:"PER",4030656:"PYF",2088628:"PNG",1694008:"PHL",1168579:"PAK",798544:"POL",3424932:"SPM",4030699:"PCN",4566966:"PRI",6254930:"PSE",2264397:"PRT",1559582:"PLW",3437598:"PRY",289688:"QAT",935317:"REU",798549:"ROU",6290252:"SRB",2017370:"RUS",49518:"RWA",102358:"SAU",2103350:"SLB",241170:"SYC",366755:"SDN",7909807:"SSD",2661886:"SWE",1880251:"SGP",3370751:"SHN",3190538:"SVN",607072:"SJM",3057568:"SVK",2403846:"SLE",3168068:"SMR",2245662:"SEN",51537:"SOM",3382998:"SUR",2410758:"STP",3585968:"SLV",7609695:"SXM",163843:"SYR",934841:"SWZ",3576916:"TCA",2434508:"TCD",1546748:"ATF",2363686:"TGO",1605651:"THA",1220409:"TJK",4031074:"TKL",1966436:"TLS",1218197:"TKM",2464461:"TUN",4032283:"TON",298795:"TUR",3573591:"TTO",2110297:"TUV",1668284:"TWN",149590:"TZA",690791:"UKR",226074:"UGA",5854968:"UMI",6252001:"USA",3439705:"URY",1512440:"UZB",3164670:"VAT",3577815:"VCT",3625428:"VEN",3577718:"VGB",4796775:"VIR",1562822:"VNM",2134431:"VUT",4034749:"WLF",4034894:"WSM",69543:"YEM",1024031:"MYT",953987:"ZAF",895949:"ZMB",878675:"ZWE"}

hroberts commented 8 years ago

I think it makes more sense just to do this on the perl side. I think I can actually do this entirely from a single postgres query, and it will be less work just to write a script that runs that query will be quicker than writing the hooks to let you write a python client.

How do I identify the geoname tags that are countries?

On Wed, Oct 19, 2016 at 11:03 AM, rahulbot notifications@github.com wrote:

Here are two lines that can help:

1) This queries to fetch the top country a media source writes about:

mc.sentenceFieldCount('media_id:39872',field='tags_id_stories', tag_sets_id='1011', sample_size=10000)[0]

Returning results like this (for the Hindustan Times):

{"count": 3348, "label": "Republic of India", "tag": "geonames_1269750", "tag_sets_id": 1011, "tags_id": 8878677}

2) This line maps geonames ID to ISO 3166-1 alpha-3 https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3

COUNTRY_GEONAMES_ID_TO_APLHA3 = {3041565:"AND",290557:"ARE",1149361:"AFG",3576396:"ATG",3573511:"AIA",783754:"ALB",174982:"ARM",3351879:"AGO",6697173:"ATA",3865483:"ARG",5880801:"ASM",2782113:"AUT",2077456:"AUS",3577279:"ABW",661882:"ALA",587116:"AZE",3277605:"BIH",3374084:"BRB",1210997:"BGD",2802361:"BEL",2361809:"BFA",732800:"BGR",290291:"BHR",433561:"BDI",2395170:"BEN",3578476:"BLM",3573345:"BMU",1820814:"BRN",3923057:"BOL",7626844:"BES",3469034:"BRA",3572887:"BHS",1252634:"BTN",3371123:"BVT",933860:"BWA",630336:"BLR",3582678:"BLZ",6251999:"CAN",1547376:"CCK",203312:"COD",239880:"CAF",2260494:"COG",2658434:"CHE",2287781:"CIV",1899402:"COK",3895114:"CHL",2233387:"CMR",1814991:"CHN",3686110:"COL",3624060:"CRI",3562981:"CUB",3374766:"CPV",7626836:"CUW",2078138:"CXR",146669:"CYP",3077311:"CZE",2921044:"DEU",223816:"DJI",2623032:"DNK",3575830:"DMA",3508796:"DOM",2589581:"DZA",3658394:"ECU",453733:"EST",357994:"EGY",2461445:"ESH",338010:"ERI",2510769:"ESP",337996:"ETH",660013:"FIN",2205218:"FJI",3474414:"FLK",2081918:"FSM",2622320:"FRO",3017382:"FRA",2400553:"GAB",2635167:"GBR",3580239:"GRD",614540:"GEO",3381670:"GUF",3042362:"GGY",2300660:"GHA",2411586:"GIB",3425505:"GRL",2413451:"GMB",2420477:"GIN",3579143:"GLP",2309096:"GNQ",390903:"GRC",3474415:"SGS",3595528:"GTM",4043988:"GUM",2372248:"GNB",3378535:"GUY",1819730:"HKG",1547314:"HMD",3608932:"HND",3202326:"HRV",3723988:"HTI",719819:"HUN",1643084:"IDN",2963597:"IRL",294640:"ISR",3042225:"IMN",1269750:"IND",1282588:"IOT",99237:"IRQ",130758:"IRN",2629691:"ISL",3175395:"ITA",3042142:"JEY",3489940:"JAM",248816:"JOR",1861060:"JPN",192950:"KEN",1527747:"KGZ",1831722:"KHM",4030945:"KIR",921929:"COM",3575174:"KNA",1873107:"PRK",1835841:"KOR",831053:"XKX",285570:"KWT",3580718:"CYM",1522867:"KAZ",1655842:"LAO",272103:"LBN",3576468:"LCA",3042058:"LIE",1227603:"LKA",2275384:"LBR",932692:"LSO",597427:"LTU",2960313:"LUX",458258:"LVA",2215636:"LBY",2542007:"MAR",2993457:"MCO",617790:"MDA",3194884:"MNE",3578421:"MAF",1062947:"MDG",2080185:"MHL",718075:"MKD",2453866:"MLI",1327865:"MMR",2029969:"MNG",1821275:"MAC",4041468:"MNP",3570311:"MTQ",2378080:"MRT",3578097:"MSR",2562770:"MLT",934292:"MUS",1282028:"MDV",927384:"MWI",3996063:"MEX",1733045:"MYS",1036973:"MOZ",3355338:"NAM",2139685:"NCL",2440476:"NER",2155115:"NFK",2328926:"NGA",3617476:"NIC",2750405:"NLD",3144096:"NOR",1282988:"NPL",2110425:"NRU",4036232:"NIU",2186224:"NZL",286963:"OMN",3703430:"PAN",3932488:"PER",4030656:"PYF",2088628:"PNG",1694008:"PHL",1168579:"PAK",798544:"POL",3424932:"SPM",4030699:"PCN",4566966:"PRI",6254930:"PSE",2264397:"PRT",1559582:"PLW",3437598:"PRY",289688:"QAT",935317:"REU",798549:"ROU",6290252:"SRB",2017370:"RUS",49518:"RWA",102358:"SAU",2103350:"SLB",241170:"SYC",366755:"SDN",7909807:"SSD",2661886:"SWE",1880251:"SGP",3370751:"SHN",3190538:"SVN",607072:"SJM",3057568:"SVK",2403846:"SLE",3168068:"SMR",2245662:"SEN",51537:"SOM",3382998:"SUR",2410758:"STP",3585968:"SLV",7609695:"SXM",163843:"SYR",934841:"SWZ",3576916:"TCA",2434508:"TCD",1546748:"ATF",2363686:"TGO",1605651:"THA",1220409:"TJK",4031074:"TKL",1966436:"TLS",1218197:"TKM",2464461:"TUN",4032283:"TON",298795:"TUR",3573591:"TTO",2110297:"TUV",1668284:"TWN",149590:"TZA",690791:"UKR",226074:"UGA",5854968:"UMI",6252001:"USA",3439705:"URY",1512440:"UZB",3164670:"VAT",3577815:"VCT",3625428:"VEN",3577718:"VGB",4796775:"VIR",1562822:"VNM",2134431:"VUT",4034749:"WLF",4034894:"WSM",69543:"YEM",1024031:"MYT",953987:"ZAF",895949:"ZMB",878675:"ZWE"}

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/67#issuecomment-254859355, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT6C2u44YPVXA0JzUt71B8vYhsFfFks5q1j9cgaJpZM4KbKDc .

Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

rahulbot commented 8 years ago

Nice. That sounds easier.

The answer to your question is a little nuanced. a) All the country geoname ids are in the white-list of ids in the COUNTRY_GEONAMES_ID_TO_APLHA3 variable in solution (2) above. So if the tag name is "geonames_" plus one of those ids, it is a country tag. b) The tags_id_stories are the countries and states a story is about. If you aggregate those on a per-source basis, the top one is guaranteed to be a country, because we always tag with both the state and country.

I've thought for a while that, retrospectively, it would have been smart to have a country-geo tag-set, a state-geo tag-set, and an other-geo tag set.... would have made querying for top country and/or top-state easier.

hroberts commented 7 years ago

The subject country tagger is running in production now. It will add tags under the 'subject_country' tag set. It will likely take a couple of weeks to tag all existing media sources.

-hal

rahulbot commented 6 years ago

This has been done for a while.

mediacloud / backend

tag media sources with country they are most about #67