Closed pypt closed 3 years ago
Those .co.za
s and such are called second-level domains. There must be a list of those somewhere.
I had a very similar process in mind:
canonical_domain(media_url:str) -> str
function that accepts an existing media domain or a full story URL and returns the canonical “cleaned up” domain we will use for deciding which media source it ismedia_id
, existing_url
, canonical_url
media_id
’s by canonical_url
and then decide which one of the should be the one to keep… then add a merge_to
column to the list with that anointed media_id
we will keep for each setThis (similar) high-level process I had in mind leaves me with two questions about your proposal:
feed
s fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?Well yeah, to accomplish the task (create a list of (media_id, decision)
pairs) you need to come up with a way to represent each media source's URL in a canonical way. Our current attempt to do it is using the domain of the media source URL (e.g. https://www.nytimes.com/
gets canonicalized into nytimes.com
), but I think this is a good opportunity to call this canonical form something else, e.g. media UID (not the best name though). I suspect that the canonical media identifier will be the source's domain name in 95% cases but let's leave some leeway for that 5% when we'll want to do something different for some sources.
How do you decide whether to delete a source? This seems pretty problematic and hard... how do we know that source wasn't part of some paper or study or something. If there is a good way to decide one we want to delete, could we instead just make sure it isn't pulling in new stories?
We can absolutely get rid of a bunch of spam websites (e.g. *.us.com
ones) which I'm sure don't serve any purpose anywhere. As for the implementation, the workflow doing the merging will have to delete the row with the old media source anyway (independently from whether it's being merged into something or deleted).
I'm not sure how feeds fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?
Have a look at #189: sometimes similar URLs "hide" in the feed URLs and not media URLs. In this particular case, media URLs are different:
# select media_id, url from media where media_id in (select media_id from feeds where url ilike '%pamplinmedia%') order by media_id;
media_id | url
----------+----------------------------------------------------------
1391 | http://beavertonvalleytimes.com/news/index.php
1392 | http://canbyherald.com/
1394 | http://estacadanews.com/news/index.php
1396 | http://lakeoswegoreview.com/news/index.php
1397 | http://fgnewstimes.com/news/index.php
1399 | http://sandypost.com/news/index.php
1400 | http://tigardtimes.com/news/index.php
1401 | http://westlinntidings.com/news/index.php
30593 | http://www.newberggraphic.com#spider
32581 | http://portlandtribune.com#spider
86434 | http://www.pamplinmedia.com/#spider
265617 | http://pamplinmediagroup.com/
653250 | http://portlandtribune.com/the-times-news/
655213 | http://portlandtribune.com/gresham-outlook-news/
659712 | http://portlandtribune.com/forest-grove-news-times-news/
662740 | http://portlandtribune.com/beaverton-valley-times-news/
662752 | http://portlandtribune.com/estacada-news-news/
662762 | http://portlandtribune.com/hillsboro-tribune-news/
662765 | http://portlandtribune.com/regal-courier-news/
662771 | http://portlandtribune.com/lake-oswego-review-news/
662776 | http://www.madraspioneer.com/
662783 | https://portlandtribune.com/clackamas-review-home
662784 | http://www.molallapioneer.com/
662789 | http://portlandtribune.com/oregon-city-news-news/
(24 rows)
but they all redirect to pamplinmedia.com
for which we have a bunch (254) of identical feeds:
# select feeds_id, media_id, url from feeds where url ilike '%pamplinmedia%' order by media_id, url;
feeds_id | media_id | url
----------+----------+--------------------------------------------------------------------------------
<...>
349104 | 1391 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
349083 | 1391 | http://www.pamplinmedia.com/component/obrss/canby-herald
349099 | 1391 | http://www.pamplinmedia.com/component/obrss/central-oregonian
<...>
349073 | 1392 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
349052 | 1392 | http://www.pamplinmedia.com/component/obrss/canby-herald
349068 | 1392 | http://www.pamplinmedia.com/component/obrss/central-oregonian
<...>
349133 | 1394 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
349112 | 1394 | http://www.pamplinmedia.com/component/obrss/canby-herald
349128 | 1394 | http://www.pamplinmedia.com/component/obrss/central-oregonian
(254 rows)
End result is that the crawler hits the poor pamplinmedia.com
server too often so they got us blocked and we don't get anything from them anymore. There are issues like that with quite a few of other media sources too. Instead of disabling feeds for those media sources or something like that, we can use this opportunity to just merge them.
I don't think it's hard to implement - you compare the sets of feed URLs between two media sources, and if they're something like 70% identical, it means that one media source has to be merged into the other.
James, did Hal reply?
On Thu, Jul 29, 2021, 04:50 rahulbot @.***> wrote:
I had a very similar process in mind:
- write a new canonical_domain(media_url:str) -> str function that accepts an existing media domain or a full story URL and returns the canonical “cleaned up” domain we will use for deciding which media source it is
- run said function on all the media sources to generate a list with media_id, existing_url, canonical_url
- process said list to group media_id’s by canonical_url and then decide which one of the should be the one to keep… then add a merge_to column to the list with that anointed media_id we will keep for each set
This (similar) high-level process I had in mind leaves me with two questions about your proposal:
- How do you decide whether to delete a source? This seems pretty problematic and hard... how do we know that source wasn't part of some paper or study or something. If there is a good way to decide one we want to delete, could we instead just make sure it isn't pulling in new stories?
- I'm not sure how feeds fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-888736146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPNRJ5Z6V3AUFLTVO43T2CXWRANCNFSM5BAYFTXA .
not yet, will update here when i hear from him
Here's from Hal:
the find_domain_media_dups.py is the script that prints what you want -- a list of media that would be merged if you switch to unique media domains, at a glance, it treats as merge candidates any pair of media sources that either shares a normalized domain or are already connected by a dup_media_id reference in the media table.
the dedup_topic_media.pl script is the thing that generates the dup_media_id fields. it also works by comparing normalized media domains, but its purpose is just to set the work around dup_media_id field in the media table. that field is mostly just used to tell the topic spider to merge stories from media joined by dup_media_id (so if the topic spider thinks a story should belong to media_id 10, but media_id 10 has a dup_media_id of 1, then all stories that would have gone into media_id 10 instead go into media_id 1). I used to have to run dedup_topic_media all the time to clean up duplicate media in topics, but I only had to run it maybe a couple times in the last couple of years just because we already have the vast majority of dup media marked in big US media.
the vast majority of those dup_media_id values were set by running the dedup_topic_media script, but every once in a while I would just set the dup_media_id manually when I found a duplicate (mostly someone would report a duplicate media source within a topic that had run).
I ran these sorts of scripts by firing up a temporary custom container. you can see a super handy script in my home directory called 'bin/start-container.sh' that will start a container of a given image with all of the necessary env vars set for the container config. I think I also had another script to dump new config variables into start-container.sh, but in practice the config variables almost never changed. that start-container.sh script was probably my handiest and most frequently used productivity script. I'm sure I ran the find_domain_media_dups script by running start-container.sh in a screen session and just running the script in the screened container.
the big things off the top of my head as you jump into this are:
- how to handle merging the media set memberships of duplicate media
- how to handle merging the stories from duplicate media, keeping in mind that stories updates in postgresql are super slow because of all the triggers and references in the stories table
- how to handle updating old topics that reference media and stories that will no longer exist once you start merging media
I've been working on a script that builds on Hal's find_domain_media_dups.py
. I'll post a link on this thread once it's in a semi-presentable state.
On the subject of how to determine which source among duplicates becomes the parent (i.e., the one to which all others are merged):
in general, we choose the media source with the best content. it's often hard to figure that out. the dedup_topic_media script is my best attempt at trying to figure out when I can just automatically decide which media source to merge into the other.
otherwise, you just have to stare at them and make a decision manually, which is why that script runs in interactive mode.
I have an idea for attempting to automate this further which I'll demonstrate in my script; I'll post that here when ready
Could you post what you have so far, no matter how unfinished / deep in the trenches / a bunch of hacks it is? I.e. would you be able to:
in general, we choose the media source with the best content. it's often hard to figure that out. the dedup_topic_media script is my best attempt at trying to figure out when I can just automatically decide which media source to merge into the other.
I'd lean towards merging into a media source with the most "presentable" / canonical URL (domain), i.e. with
media_id,domain
1,subsection1.nytimes.com
2,subsection2.nytimes.com
3,www.nytimes.com
4,nytimes.com
we should make it easy for us and merge everything into 3,www.nytimes.com
or 4,nytimes.com
(even if that media ID has no stories / feeds and was empty before).
Notebook here. I'm still working on finding a way to run the dedupe_media
function that doesn't take many hours, so I don't have a CSV to share at this point; any optimization advice you've got on that would be helpful.
Can you run with buckets of media_ids so that you can chunk it into jobs that parallelize?
<subjective_opinions>
I think it would be way easier if you loaded a simple CSV instead of reading data from PostgreSQL. Once loaded, you can speed up subsequent loads by storing the media
table pickled, but if that ends up loading slowly, you can also look into shelve
, another standard library which lets you use on-disk dictionaries as if they were residing in RAM. You can even store the input CSV in the GitHub repository itself - I think our media source list isn't confidential so we can just make that repo public together with the full list of media sources.
Same with pandas
- why bother with it? You can have a list of dicts, or a dict of dicts (mapped by media_id
, domain or something like that), a dict with media_id
keys and dict values for actual rows complemented with a bunch of lists / sets / dicts / whatever that point to that reference media_id
:
media = {
# media_id
1: {
'media_id': 1,
'name': '...',
},
2: {
# ...
},
}
# Set of media IDs to be deleted
media_to_be_deleted = {
1,
2,
3,
}
# Map of what gets merged into what
media_to_be_merged = {
# 2 gets merged into 1,
2: 1,
# ...
}
or some other native stuff which would be very fast and easy to use.
</subjective_opinions>
But again, how exactly do you come up with this merge list is up to you, I just thought I'd post some ideas.
Which part of that script is slow?
Only people on LinkedIn quote themselves, but I'll go ahead and do it:
Could you post what you have so far, no matter how unfinished / deep in the trenches / a bunch of hacks it is? I.e. would you be able to:
- Post a link to a script / notebook that does the deduplication; and
- Post a link to some intermediate results that that script / notebook has generated?
You've posted the notebook that you're working on, but we need the immediate results too. If the notebook is currently only able to filter out a bunch of spam domains and nothing else, that's great too, but still post the notebook that does nothing but spam domains + immediate media_id, name, url, decision
CSV results here (and if the notebook does spam domain filtering and almost does other stuff but not quite yet, just comment out that other stuff for now, post the link to both the notebook that generates the decision CSV and the CSV itself, and continue working on that stuff while we all stare at the notebook and the CSV). We really need to iterate on this list together, and isolating yourself in the trenches is how this particular task gives people PTSD :)
Alrighty, I've managed to whittle it down to 118K sources among the 1.86M that have yet to be classified, and are therefore tagged "review." The others have all been tagged either as parent domains, children of parent domains, delete-able, or as blogging subdomains (and, consequently, new entries). You should be able to download the work-in-progress list here if you're signed into the Media Cloud Google org:
https://drive.google.com/file/d/1JYScr3b9AmEYLHH6x6RReVNLAeVrjnRf/view?usp=sharing
Notebook repo also updated. The part that was really slowing me down was taking the non-parent domains and searching among the parents to find their appropriate relations. I ended up using the pandasql
package to do so via a LEFT JOIN
on the parents and maybe-children as opposed to loops with hundreds of thousands of dict-search operations, which proved quite time-consuming.
@pypt, to your question, I originally loaded the media
table into Postgres locally because it was handy for my workflow; I was exploring the data and querying it in Postico while I was iterating on the notebook. The multiplicity of data structures in the notebook maybe isn't the most comprehensible (postgres tables, DataFrames, pandasql
queries, dict lists), but it helped to have options when thinking through the different operations required.
My next step is to spend more time Monday looking at the media sources tagged "review" and figuring out how to handle them. I'll update here with progress.
Can you make that Drive link shareable to "Anyone with a link"? Rahul doesn't have mediacloud.org account.
Google Sheets / LibreOffice / Excel all gave up on that CSV :)
Could you make it into a SQLite3 database and post it to the Google Drive? That way we could import it into something like https://sqliteonline.com/
Also, could you get rid of the following columns in the output:
Okay, I now have (at least a first pass at) all 1.86M rows in the media
database flagged for either deletion, merger to a parent source, or no action (i.e. they're the parent source). Just shared Dropbox links with you guys for a CSV + sqlite dump—let me know if you have any trouble accessing. Some notes:
media_uuid
s for the 1.86M media_id
s in the dataset. That means roughly 260k duplicate sources to be merged.url
associated with a given media_uuid
. That means, for example, that http://traveltips.usatoday.com/
, http://www.usatoday.com/#spider
, and http://onpolitics.usatoday.com/
are to be merged with http://usatoday.com
.url
for a media_uuid
tracks with that, since what you end up getting is homepages as parents that subsume subdomains and section paths.When you update it after some further changes, would you be able to both:
dup_media_id
into the output?I've imported the dump into a database and uploaded it here, to the "Media deduplication immediate results" folder in the "Shared drives" / "Backend":
https://drive.google.com/drive/u/0/folders/1I77rqmbwtiqgoh3c7epFwt49WiTxwUtH
So then, as of Aug 24, 2021, the permalink to the script (notebook) that generates the results is:
and the immediate results themselves are here:
https://drive.google.com/file/d/1V1WX3jZEvC0wGCgIfm4QP2DMBmYmB5Id/view?usp=sharing
A pretty decent app for browsing those SQLite3 databases is https://sqlitebrowser.org/dl/, e.g. it lets you search for arbitrary strings in all the columns, but there are many options out there.
Some observations:
Aside from URLs that are plain invalid (and that you have filtered out already (I think)), there are a lot of URLs that don't make sense, e.g. the ones with http://host:443/
- port 443 is for SSL (https://
), and this URL basically says "connect to host
's 443 (secure) port but try an unencrypted connection". I think the intention of whoever added those URLs here was to add https://host
instead of http://host:443
. Could you add a step somewhere early in your notebook to rewrite all of these :443
URLs to plain https://host
? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid
entries with both host
and host:443
.
Then maybe we should use this as an opportunity to update media.url
(to be able to get rid of normalized_url
column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?
The "merge everything into the first media source (as sorted by media_id
) with the representative-looking URL" didn't quite work for latimes.com
:
select * from deduped_media where media_id in (6, 18444);
media_id | url | media_uuid | name | normalized_url | action | is_parent | parent_id | parent_url
-- | -- | -- | -- | -- | -- | -- | -- | --
6 | http://www.latimes.com/ | latimes.com | LA Times | http://latimes.com/ | merge(18444) | no | 18444 | http://www.latimes.com
18444 | http://www.latimes.com | latimes.com | latimes.com | http://latimes.com/ | skip | yes | 18444 | http://www.latimes.com
i.e. it considers the URL with media_id = 6
as non-"representative" (due to the slash suffix?) and decides to merge everything into media_id = 18444
.
Be wary of potential duplicates of those media URLs with #spider
fragment - seemingly they sometimes include a single slash for path (http://host/#spider
) and other times they don't (http://host#spider
)
#spider
media URLs come from? It looks like they get added in the topic mapper. Here it tries to find a URL that does not yet exist in the media
table, and if it fails at doing that, it adds the #spider
fragment to the URL:and it supposedly adds those media sources for stories discovered through the topic mapper:
It does try to reuse an existing media source, but is quick to give up and just slap on a new one because:
Terse!
We need to get rid of these #spider
media sources before we start merging all of them because otherwise the topic mapper will just fill up the table with a bunch of new duplicates in a day or so. Could you email Hal and ask him what does that comment above mean?
In the notebook itself, you have a list of domains that "have no relation of media":
# sites that bear no relation to media
GARBAGE_DOMAINS = ['splashthat.com', 'lin.is', 'mail.ru', 'readthedocs.io', 'myshopify.com',
'hotels-pt.net', 'conferenceseries.com', 'hotelescotedazur.com', 'deviantart.com',
'top-hotels-gr.com', 'sourceforge.net', 'podbean.com', 'webstatsdomain.org', 'hotels-sicily.net',
'hotels-istria.net', 'zoom.us', 'schoolloop.com', 'oursite.com', 'translate.goog', 'arcgis.com',
'bandcamp.com', '', 'square.site', 'liveinternet.ru', 'netdna-cdn.com', 'lnk.to', 'convey.pro',
'photobucket.com', 'netdna-ssl.com', 'sched.com', 'itch.io', 'apache.org', 'newhookupsite.biz'
'as.me', 'freeadsaustralia.com', 'lpages.co', 'freeadsinuk.co.uk', 'app.link', 'custhelp.com',
'memberclicks.net', 'webex.com', 'dropboxusercontent.com', 'debian.org', 'googleusercontent.com',
'sourceforge.io', 'hyatt.com', 'marriot.com', 'eventbrite.com', 'seamlessdocs.com',
'ticketspice.com', 'us.com', 'newoffers.info', 'craigslist.org', 'craigslist.co.th',
'askdefine.com', 'webstatsdomain.org', 'translate.goog', 'qualtrics.com', 'zendesk.com',
'business.site']
Unfortunately, counterintuitively, and (again) thanks to the topic mapper, the rows in media
don't have to have much to do with news media. Topic mapper spiders the web looking for links from news articles to other news articles, and it is unable to determine whether the newly found URL (story) comes from a news media or not. Every story has to have a media source as per the schema design, so to accommodate a new story, it has to find / create a media source for every story that was found through spidering.
Therefore, GARBAGE_DOMAINS
should only include domains that are truly just garbage, e.g. V*agra shilling websites. Every other media "source" is legit there.
Do you look into feed URLs when looking for duplicate media sources? Here's an example: https://github.com/mediacloud/backend/issues/189. Is there a way to deduplicate those too?
Thanks for your work on this, really taking shape! I'll stare at it a bit more later.
- Could you add a step somewhere early in your notebook to rewrite all of these :443 URLs to plain https://host? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid entries with both host and host:443.
- Then maybe we should use this as an opportunity to update media.url (to be able to get rid of normalized_url column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?
This bears discussion over Hangouts, but I think that instead of modifying the existing url
values in the table, we should add a new media_uuid
column to the schema and parse that value on the backend for incoming stories/sources (a la @rahulbot's canonical_domain(media_url:str) -> str
suggestion) . In that case, tldextract
would dedupe http://james.com
, https://james.com
, and http://james.com:443
into the single media_uuid
james.com
, and we wouldn't care about whether :443
is in a url
.
In any case, this should be a separate task IMO, although one deployed before the deduping.
...i.e. it considers the URL with media_id = 6 as non-"representative" (due to the slash suffix?)
Correct, technically the notebook just looking for the shortest url
string once www
has been removed. I can add a step to also remove /
from the shortest-url calculation.
Be wary of potential duplicates of those media URLs with #spider fragment...
Emailed Hal. I suspect the /#spider
vs. spider
issue is the result of the general fact that the presence of trailing slashes on url
s is inconsistent—e.g., both http://latimes.com
and http://latimes.com/
are in the media
table. This is perhaps another argument for introducing media_uuid
to the schema and using it as the distinguishing field for deduplication; we can then leave url
, with all its goofiness, aside.
GARBAGE_DOMAINS
should only include domains that are truly just garbage, e.g. V*agra shilling websites...
We should talk in the Wednesday group meeting about what the heuristic for GARBAGE_DOMAIN
status should be. For example, if we say a V*agra-shilling website falls into this category, what's the conceptual distinction between that and a site shilling discount hotel rooms, such as hotels-sicily.net
, or a site built for RSVPs to corporate events? (Sidenote: there are ~8000 hotel booking-related sites in the media
table) In the future, perhaps we have the data curator hire spot-check new media
sources on a weekly basis or something.
Do you look into feed URLs when looking for duplicate media sources?
At present, no, but certainly those could be deduplicated in similar fashion. I think this also should be a separate task if that makes sense to you.
Thanks for this thoughtful guidance! Updated database + CSV export to come in Google Drive later today.
- Could you add a step somewhere early in your notebook to rewrite all of these :443 URLs to plain https://host? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid entries with both host and host:443.
- Then maybe we should use this as an opportunity to update media.url (to be able to get rid of normalized_url column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?
This bears discussion over Hangouts, but I think that instead of modifying the existing url values in the table, we should add a new media_uuid column to the schema and parse that value on the backend for incoming stories/sources (a la @rahulbot https://github.com/rahulbot's canonical_domain(media_url:str) -> str suggestion) . In that case, tldextract would dedupe http://james.com, https://james.com, and http://james.com:443 into the single media_uuid james.com, and we wouldn't care about whether :443 is in a url.
In any case, this should be a separate task IMO, although one deployed before the deduping.
Those would be two related columns that have different purpose:
media.url
- representative URL that gets shown to the user and that
someone can click on and go to that website;media.media_uuid
- unique identifier (almost always a hostname) used by
us internally to prevent duplicates ending up in the media
table.So we'll definitely have to add the media_uuid
column at some point later
on, set media UUIDs for all existing media, and add some code somewhere to
check for duplicates before a media source gets added, but given that
you'll be normalizing url
s anyway at this point for this specific task
(coming up with a list of what should get merged into what), e.g. fixing
the http://host:443
entries, I figured that we might as well update them
in the media
table.
But sure, we can leave them for later too.
GARBAGE_DOMAINS should only include domains that are truly just garbage, e.g. V*agra shilling websites...
We should talk in the Wednesday group meeting about what the heuristic for GARBAGE_DOMAIN status should be. For example, if we say a V*agra-shilling website falls into this category, what's the conceptual distinction between that and a site shilling discount hotel rooms, such as hotels-sicily.net, or a site built for RSVPs to corporate events? (Sidenote: there are ~8000 hotel booking-related sites in the media table) In the future, perhaps we have the data curator hire spot-check new media sources on a weekly basis or something.
This is unfortunately how this is currently architected - all stories need a parent media ID, and topic mapper ends up with stories with no parent media source - and rearchitecting it for it to be some other way is six months of work. It's not a superb architecture but this is how things currently work, and it would be super hard to change that.
I think we don't particularly care that much about the Xanax websites in
the media
table - if it's easy to spot (which it is), and easy to delete
(we'll have to have a media source removal workflow step in any case), and
it's absolutely positively useless to us (e.g. it's a spam website), then
we should get rid of it, otherwise leave them be - they're just rows in a
table. Hotel media sources and such are potentially useful - what if a
bunch of news websites link to a specific funny / problematic Airbnb ad? :)
(They never do, but that's at least the idea behind the topic mapper.)
What is an actual pain point for users (and the crawler), and the reason why we do this deduplication in the first place, is that there are X copies of nytimes.com, Y copies of latimes.com and Z copies of usatoday.com - actual media sources with actual news that belong to collections.
Do you look into feed URLs when looking for duplicate media sources?
At present, no, but certainly those could be deduplicated in similar fashion. I think this also should be a separate task if that makes sense to you.
You'll have to look into feed URLs as part of this task too unfortunately as they cause problems too - in the linked task, we have something like 20 copies of pamplinmedia.com which is a legit news website, therefore crawler hits that website a gazillion times a day so they got us blocked.
Thanks for this thoughtful guidance! Updated database + CSV export to come in Google Drive later today.
No need for the CSV, just the pre-imported database + permalink to the current version of the notebook!
I don't have permission to upload to that Google Drive folder at present, but here's the latest deduping:
https://drive.google.com/file/d/1BYePH2w6RhXHPU-eivYF1ZJDhlFBAFLB/view?usp=sharing
Notebook permalink here:
https://github.com/mediacloud/media-dedupe-notebook/blob/master/media_dedupe.ipynb
I see what you mean now about the URLs—in all this database spelunking I'd forgotten about how this stuff actually gets displayed to users 😛 . I've added in the notebook a step to address the :443
issue.
Re: GARBAGE_DOMAINS
, I still don't really get the distinction; you could have a spam site listing discount hotel rooms, or conversely you could have blogs linking to a particularly comical Xanax broker. In any case, I think it would be valuable to hear from the researchers on this.
Here's from Hal on the #spider
issue:
the topic spider has to autogenerate media sources for stories for which it cannot find a matching media source (it tries to find a matching media source based on the hostname in the url of a story that itself did not match an existing story in the database by url).
for some reason that is long lost to the fog of war, it used to be possible for some reason for the topic spider to try to create a media source with a url that already existed in the database. my infinitely wise answer was just to stick '#spider' on the end of the spider generated media source url, so you could end up with a 'http://nytimes.com/' media source and a 'http://nytimes.com/#spider' media source. I think one issue was that some media source are intentionally ignored by the topic spider for the purposes of discovery (such as media sources with foreign_rss_links set to true, which indicates that the media source should be ignored because it includes stories with urls that belong to foreign domains).
in general, these #spider and non-#spider media sources should be merged, but there's probably not a hard rule about which one you should use as the master.
Fortunately I think our shortest-url approach should work well enough as such a hard rule.
I don't have permission to upload to that Google Drive folder at present, but here's the latest deduping:
https://drive.google.com/file/d/1BYePH2w6RhXHPU-eivYF1ZJDhlFBAFLB/view?usp=sharing
My bad, could you try again?
https://drive.google.com/drive/u/0/folders/1I77rqmbwtiqgoh3c7epFwt49WiTxwUtH
(I've uploaded your most recent SQLite3 database to it already.)
Also, could you add a "-yyyymmdd" suffix to the database filenames?
I haven't looked at the actual database yet, I'll do it a bit later.
Notebook permalink here:
https://github.com/mediacloud/media-dedupe-notebook/blob/master/media_dedupe.ipynb
This link points to the latest version of the file (notebook) in the "master" branch, so when you'll update the notebook, this link will no longer point to the previous version of it. We'll need that to be able to do comparisons between different iterations of the notebook.
To get a permalink, one can click on three dots at the top right of the GitHub page and select "Copy permalink".
I see what you mean now about the URLs—in all this database spelunking I'd
forgotten about how this stuff actually gets displayed to users 😛 . I've added in the notebook a step to address the :443 issue.
Thanks James!
Re: GARBAGE_DOMAINS, I still don't really get the distinction; you could have a spam site listing discount hotel rooms, or conversely you could have blogs linking to a particularly comical Xanax broker. In any case, I think it would be valuable to hear from the researchers on this.
Let me know if it's still unclear to you and I'll try to rephrase it.
Here's from Hal on the #spider issue:
the topic spider has to autogenerate media sources for stories for which it cannot find a matching media source (it tries to find a matching media source based on the hostname in the url of a story that itself did not match an existing story in the database by url).
for some reason that is long lost to the fog of war, it used to be possible for some reason for the topic spider to try to create a media source with a url that already existed in the database. my infinitely wise answer was just to stick '#spider' on the end of the spider generated media source url, so you could end up with a 'http://nytimes.com/' media source and a 'http://nytimes.com/#spider' media source. I think one issue was that some media source are intentionally ignored by the topic spider for the purposes of discovery (such as media sources with foreign_rss_links set to true, which indicates that the media source should be ignored because it includes stories with urls that belong to foreign domains).
in general, these #spider and non-#spider media sources should be merged, but there's probably not a hard rule about which one you should use as the master.
This is one of the reasons why this task resembles 4chan's "1000 US Marines vs a stick" problem:
How this could be a problem is that if we merge medium X (url = ' http://host/#spider' and foreign_rss_links = 'f') into medium Y (url = ' http://host/' and foreign_rss_links = 't') and make the topic mapper no longer create the #spider media sources, it might break the topic spider's discovery (I wonder what that means).
Could you re-add the "foreign_rss_links" column to your output so that we'll get to see how many of media sources (especially the ones to be merged) will have it set to true? I'll try to find out how and why media with foreign_rss_links is ignored.
GitHub stripped my meme:
Dumped the latest db version to Google Drive, notebook permalink here: https://github.com/mediacloud/media-dedupe-notebook/blob/38f77c06c9087780f27ab6ea3a7a2b3f4f79216d/media_dedupe.ipynb
Looks like 1830 url
s in the table have foreign_rss_links
; of those 108 are to be merged with parent sources
Thank you again James! Give me a day or two to stare at them for a little bit and get my shower thoughts together.
On Thu, Aug 26, 2021 at 12:12 AM jtotoole @.***> wrote:
Dumped the latest db version to Google Drive, notebook permalink here: https://github.com/mediacloud/media-dedupe-notebook/blob/38f77c06c9087780f27ab6ea3a7a2b3f4f79216d/media_dedupe.ipynb
Looks like 1830 urls in the table have foreign_rss_links; of those 108 are to be merged with parent sources
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-905878535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPNSGDXSBHOZGBQVMQDT6VMCZANCNFSM5BAYFTXA .
-- Linas Valiukas www.mediacloud.org
Sure thing, we can huddle about it during our 1:1
Meanwhile could you look at the "feeds" table (if you haven't already) - read through the linked issue, using just the "feeds" table try to find media_ids that have at least one identical feed URL between them, and post that list somewhere (a new table in the SQLite3 database?).
On Thu, Aug 26, 2021, 19:26 jtotoole @.***> wrote:
Sure thing, we can huddle about it during our 1:1
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-906559877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPM5KOQYGXMG2EFHSPLT6ZTKHANCNFSM5BAYFTXA .
I did some more tinkering to identify the duplicate sources hidden within the feeds
table:
There we find 3,635 media_id
s that can be deduped to 1,353. I added a new table to the database (updated in Google Drive) where one can browse these records; it has the unwieldy name feeds_with_multiple_parent_media
.
The tricky part here is that I'm not sure there's an automate-able heuristic we can use to identify the logical parent in these groups. For example, newstalk770.com
and cknw.com
share the feed URL https://globalnews.ca/feed/
; it's only by actually visiting the sites that you can make a judgment about which one should be preeminent (in that case the first 404s and the second redirects to a globalnews.ca relative path).
Additionally, in some of these cases we don't actually want to do the deduplication. For instance, a group of 17 regional newspapers owned by the same conglomerate have a common feed URL https://www.digitalfirstmedia.com/feed/
, but the content on their sites (e.g. sentinelandenterprise.com, lowellsun.com) is different. That's because while they share that particular feed url, they also have separate, per-publication feed URLs (e.g. sentinelandenterprise.com/feed, lowellsun.com/feed), the media
/feeds
relationship being one-to-many.
My suggestion is that, given that this is a relatively small number of additional duplicates ( < 1% of the total discovered via the media
table) that we leave these aside for now, and later I or the incoming data curator can go through the list manually to make merge decisions.
I added a new table to the database (updated in Google Drive) where one can browse these records; it has the unwieldy name feeds_with_multiple_parent_media.
Can you now and in the future post an URL to the specific database file that you're referring to?
# it turns out there are lots of disparate sources whose feed URL points to rdfs.org/xmlns.com/w3.org,
# which host sample specs for such feeds; also lots of sites with feed URLs pointing to wordpress template feeds
# side note: maybe we should delete these sources from the media table?
What's a "wordpress template feed"? Could you give an example?
Do those feeds point to "topic" media sources (the ones with #spider
)?
Dunno about deleting them, I think deduplicating the useful ones (i.e. the ones with working feeds and / or in collections) is more important, and these ones are just rows in a table.
In [157]: feeds_with_multiple_parent_media_ids
Out[157]:
[{'http://www.kp.ru/rss/allsections.xml': [1760, 47226, 561409]},
{'http://www.mk.ru/rss/index.xml': [1775, 385506]},
<...>{'http://rss.cbc.ca/lineup/canada-ottawa.xml': [7333, 83917]},
...]
What's what in this output? Is there a way to present it more nicely, e.g. a table?
It's also truncated at the end. If it's too many rows for the notebook to handle, could you "publish" it as a SQLite3 table?
The tricky part here is that I'm not sure there's an automate-able heuristic we can use to identify the logical parent in these groups. For example, newstalk770.com and cknw.com share the feed URL https://globalnews.ca/feed/; it's only by actually visiting the sites that you can make a judgment about which one should be preeminent (in that case the first 404s and the second redirects to a globalnews.ca relative path).
Don't we have a media source for globalnews.ca
(if we don't, that's not great)?
How many of those un-automatable potential merges do we have?
Additionally, in some of these cases we don't actually want to do the deduplication. For instance, a group of 17 regional newspapers owned by the same conglomerate have a common feed URL https://www.digitalfirstmedia.com/feed/, but the content on their sites (e.g. sentinelandenterprise.com, lowellsun.com) is different. That's because while they share that particular feed url, they also have separate, per-publication feed URLs (e.g. sentinelandenterprise.com/feed, lowellsun.com/feed), the media/feeds relationship being one-to-many.
Thank you for spotting this. We still have to get rid of that shared feed somehow. What would be your proposed heuristics for that?
My suggestion is that, given that this is a relatively small number of additional duplicates ( < 1% of the total discovered via the media table) that we leave these aside for now, and later I or the incoming data curator can go through the list manually to make merge decisions.
Sorry, but no:
media
table) - even though you seem to disagreemedia
alone aren't that impactful, i.e. it's technically great that we merge longdeaddomain.com
and longdeaddomain.com/#spider
together, but the ones that do have feeds from which we actually fetch content are the most important, and we really want to get those in order if we canhttps://drive.google.com/file/d/1PQBTXHGzHjcvVr-A2R-4AeblTNekrdYx/view?usp=sharing
^That latest SQLite database has the table feeds_with_multiple_parent_media
, meaning feed urls that are associated with at least two different media sources flagged as parents based on the deduping of the media
table. There's one entry per feed_id
, even though there are many duplicate feed url
s in the table.
Here's an example of what I mean by "WordPress template" feed:
You can see that full list of results by running the query
SELECT m.url AS media_url, f.feeds_id, f.media_id, f.url AS feed_url, f.occurences, f.type, f.active
FROM duplicate_feeds f
JOIN deduped_media m
ON m.media_id = f.media_id
WHERE f.url = "https://wordpress.com/blog/feed/"
There are some #spider
media URLs in that result set, but most are not.
Yes, globalnews.ca
also has an entry in the media
table. It looks like at some point the URLs for their local affiliates (e.g. cknw.com
) started redirecting to globalnews.ca
section pages (https://globalnews.ca/radio/cknw
). The old affiliate URLs remain in the media
table, however.
Based on my investigation there are 3,635 parent media sources associated with 1,353 shared feed URLs (so media sources A, B, and C share the feed URL foo.com, D and E share bar.com, etc.). As with the case of Digital First Media, the conclusion upon examination of some groups will be that no media merging needs to happen, but that a non-substantive shared feed (https://www.digitalfirstmedia.com/feed/) should be deleted. I'm not sure of a way to make decisions for each of those 1,353 groups besides manually checking each of the media URLs and feed URLs, but of course there may be a method I haven't thought of. Unless you've got another idea I'll find a two-hour trance mix and start going through that list.
Thanks James, give me a day to stare at it for a little while.
On Tue, Aug 31, 2021 at 11:14 PM jtotoole @.***> wrote:
https://drive.google.com/file/d/1PQBTXHGzHjcvVr-A2R-4AeblTNekrdYx/view?usp=sharing
^That latest SQLite database has the table feeds_with_multiple_parent_media, meaning feed urls that are associated with at least two different media sources flagged as parents based on the deduping of the media table. There's one entry per feed_id, even though there are many duplicate feed urls in the table.
Here's an example of what I mean by "WordPress template" feed:
[image: Screen Shot 2021-08-31 at 3 45 57 PM] https://user-images.githubusercontent.com/6501236/131566269-c9fd33a1-80e1-4649-b864-2970274b0888.png
You can see that full list of results by running the query
SELECT m.url AS media_url, f.feeds_id, f.media_id, f.url AS feed_url, f.occurences, f.type, f.active FROM duplicate_feeds f JOIN deduped_media m ON m.media_id = f.media_id WHERE f.url = "https://wordpress.com/blog/feed/"
There are some #spider media URLs in that result set, but most are not.
Yes, globalnews.ca also has an entry in the media table. It looks like at some point the URLs for their local affiliates (e.g. cknw.com) started redirecting to globalnews.ca section pages ( https://globalnews.ca/radio/cknw). The old affiliate URLs remain in the media table, however.
Based on my investigation there are 3,635 parent media sources associated with 1,353 shared feed URLs (so media sources A, B, and C share the feed URL foo.com, D and E share bar.com, etc.). As with the case of Digital First Media, the conclusion upon examination of some groups will be that no media merging needs to happen, but that a non-substantive shared feed ( https://www.digitalfirstmedia.com/feed/) should be deleted. I'm not sure of a way to make decisions for each of those 1,353 groups besides manually checking each of the media URLs and feed URLs, but of course there may be a method I haven't thought of. Unless you've got another idea I'll find a two-hour trance mix and start going through that list.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-909597005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPOJI6QFZN7GQVU4X43T7UZ4DANCNFSM5BAYFTXA .
-- Linas Valiukas www.mediacloud.org
👍
Updated database with a top_sources
table comprising the ~15k sources from our most important collections (per Emily):
https://drive.google.com/file/d/1lO8BaSyylbpsYDY57KFDw7tIZDFXEFNa/view?usp=sharing
James, could you address the Emily's notes? I've sent a reply with my proposed solutions.
Have you decided on a course of action regarding the duplicate feeds or do you want me to take a look?
I can address Emily's notes—I'll post an updated permalink + sqlite db with the changes reflected. In the meantime, I've been going through the duplicate feeds and associated sites manually to make merge determinations. I've spent a lot of time poking at it and haven't come up with a way to automate. Imagine the case of foo.com
and bar.com
sharing the RSS feed baz.com/rss
: in some instances foo.com
is defunct and redirects to bar.com
, in some cases the other way round, and in some cases both foo.com
and bar.com
are valid sites associated with the nonfunctional feed baz.com/rss
in addition to their own distinct feeds. This is the situation we face. I've got a spreadsheet with merge decisions and delete-this-feed decisions here:
https://docs.google.com/spreadsheets/d/179zq_SfTTaLoQ8abIvgt4nv2pOt7Yq40IsrA2XIRC4M/edit?usp=sharing
It's very tedious work, but I'm through about 1,000 of the 2723 sites that need to be reviewed, so hopefully I'll be done by the end of the week.
Okay, I think we may be ready for phase 2. I've incorporated the feedback from Emily+Linas and added a couple more clean-up steps of my own in the notebook:
New sqlite database here: https://drive.google.com/file/d/1lO8BaSyylbpsYDY57KFDw7tIZDFXEFNa/view?usp=sharing
Feeds-based deduplication CSV here (to be run after the initial deduplication based on the media
table): https://github.com/mediacloud/media-dedupe-notebook/blob/ad836184f57ae6429d11b5301e42f6fed783095e/media_dedupe.ipynb
https://drive.google.com/file/d/1Y4UBDbZVcV3_Xmz1MOHnZpnb2F8tfGAN/view?usp=sharing
I think the link to the feeds CSV is wrong, can't find it in the repo at least.
Could you merge the decision lists on duplicate media + feeds into one? Those two lists are here to serve the same purpose, and merging / deleting feeds is a subset of merging / deleting media sources, so we should do it in one go.
I think the link to the feeds CSV is wrong
I'm so dumb, sorry—corrected above.
Here's a final media merge list, comprising duplicates found in both feeds
and media
:
https://drive.google.com/file/d/1LKjyOWiCTsrstNADMwQmwUmVm8Sw4hRR/view?usp=sharing
Feeds to be deleted here:
https://drive.google.com/file/d/1KvKqkNlbXJWtgj5_X-3RkJkKWLvT6tfp/view?usp=sharing
Could you please:
YYYYMMDD
prefix instead of MMDDYYYY
? It's consistent with how the rest of the files are named, and YYYYMMDD
just gets things sorted more nicely.Lastly, I still think that we need to work out a way to merge the media source decision list and feed decision list into one to make it into a single "list of actions that have to be done to get everything deduplicated nicely" list that would be easy to act upon in the so-called stage two of the project.
To do that, we have to "sneak peek" into what doing the actual media / feed merging will entail in a technical sense.
For example, the workflow to merge one feed (src_feeds_id
) into another (dst_feeds_id
) will look as follows:
feeds
table with <src_feeds_id>
:
feeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the downloads
tablefeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_stories_map
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the scraped_feeds
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_from_yesterday
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_tags_map
table, taking into account that there could be duplicatesfeeds_id = <src_feeds_id>
from the feeds
table:
feeds_id = <src_feeds_id>
from the downloads
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from the feeds_stories_map
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from the feeds_tags_map
table - there shouldn't be any left as we've just merged themfeeds
.And the workflow to merge one media source (src_media_id
) into another (dst_media_id
) will look something like this:
NULL
) media.dup_media_id
column as it's not useful anymoremedia
table with <src_media_id>
:
media_id = <dst_media_id>
on rows with media_id = <src_media_id>
in the media_rescraping
table, taking into account that there could be duplicatesmedia_id = <dst_media_id>
on rows with media_id = <src_media_id>
in the media_stats
table, taking into account that there could be duplicatesmedia_id = <dst_media_id>
on rows with media_id = <src_media_id>
in the feeds
table, taking into account that there could be duplicates; those duplicate feeds need to be merged and not simply deleted because there could be stories and such associated with those feeds; so essentially at his point we have to - again - merge a bunch of <src_feeds_id>
into <dst_feeds_id>
for every feed in <src_media_id>
:
feeds
table with <src_feeds_id>
:
feeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the downloads
tablefeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_stories_map
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the scraped_feeds
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_from_yesterday
table, taking into account that there could be duplicatesfeeds_id = <dst_feeds_id>
on rows with feeds_id = <src_feeds_id>
in the feeds_tags_map
table, taking into account that there could be duplicatesfeeds_id = <src_feeds_id>
from the feeds
table.
feeds_id = <src_feeds_id>
from the downloads
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from the feeds_stories_map
table - there shouldn't be any left as we've just merged themfeeds_id = <src_feeds_id>
from the feeds_tags_map
table - there shouldn't be any left as we've just merged themfeeds
.media_id
merging magic to a bunch of other tables)media_id
magic to Solr as media_id
s get stored there too)media_id = <src_media_id>
from the media
table:
media_id = <src_media_id>
from the media_rescraping
table - there shouldn't be any left as we've just merged themSee the overlap? Merging one feed into another is an integral part of the media merging, just like deleting the media source is, so we can use the same exact workflow to merge / delete both media sources and the individual feeds. The only issue is to think of a convenient way to define inputs for such a workflow.
As always, let me know if you have any questions or need clarification.
Latest DB here:
https://drive.google.com/file/d/1miYPjEGS7DNBHLsoj6oJBk0-sW3fxO0a/view?usp=sharing
The two tables relevant for the merging are deduped_media_final
, which contains action steps for media merging, and feed_actions
, which lists feeds that are to be merged with other feeds or, if the column value parent_feed_id
is NULL
, should be deleted. If we decide we actually don't want to delete any feeds, we can ignore those rows. Per our discussion on Friday, I decided to leave aside the task of deleting spam media sources.
Some weirdness here and there:
deduped_media_final
:
Get rid of duplicates.
For example, table has a bunch of duplicates with different decisions about them, e.g. there are two entries with media_id = 43
(media_uuid = 43
, media_url = 'http://newsok.com/'
) - the decision about one of them is to skip it, the other is destined to be merged into media source with media_id = 279980
.
Make sure media sources with a decision to merge(...)
point only to target media sources with action being skip
.
Given that we'll do the media merging in parallel, we can't decide to merge media source A into media source B and then media source B into media source C, or media source A into media source B and then delete media source B, as we don't know in which order the media sources will be merged.
For example:
Media source with media_id = 0
(media_uuid = NULL
, media_url = 'http://dev.null'
) is set to be merged into media source with media_id = 21097
(media_uuid = NULL
, media_url = 'http://http:'
), however media source with media_id = 21097
is set to be deleted. If one workflow will attempt to merge media source A into media source B and then another workflow will be busy deleting media source B, we might run into operational issues. Could you look and address cases like that?
A bunch of newsok.com
duplicates (i.e. media sources with media_id IN (20653, 164137, 174357, 370924, 489444, 645972, 888425, 1438178)
) are set to be merged into media source with media_id = 43
(media_uuid = 'newsok.com'
, media_url = 'http://newsok.com/'
), but this media source itself is set to be merged into media source with media_id = 279980
. Similarly to the issue with media_id = 0
, this would force us do the merging and deleting in a particular order which we won't.
Address the what seems to be cruft catch-all media source with media_id = 21097
.
Weirdly enough, quite a few media sources (2670 to be exact) have a decision to be merged into media source with media_id = 21097
(media_uuid = NULL
, media_url = 'http://http:'
) about them; one can find them for searching for merge(21097)
in the action
column:
Most of those media sources have invalid URLs, e.g.:
http://81.79/
http://cran.r-projectorg/
http://lizard.home.inr.net)./
http://google.comdirgharajprasai/
arles-munger-gives-nearly-2-million-to-influence-assembly-races.html#spider
view-source:https://www.univision.com/dallas/kuvn
but some of these URLs are valid too, e.g.:
http://gov.mo/
http://kharkov.ua/
http://pi.gov.br/
http://gov.bn/
http://64.233.169.104/#spider
(If Brunei's government websites have domains of DEPARTMENT.gov.bn
, e.g. http://moh.gov.bn/
, then it's totally valid for Brunei to have a http://gov.bn/
website too.)
I'm guessing that you need to improve your URL validator somewhere. The valid URLs from the list need to stay (skip
or merge(...)
), and the invalid ones should be deleted (or skipped).
(It's not the same as deleting spam media sources as one has to judge what's spam and what's not; here you can just automatically pick invalid URLs for deletion. However, not deleting those cruft media sources is an option too, you can just skip
them, but in any case they shouldn't get merged into 21097
.)
stuff.co.nz from Emily's email:
Stuff.co.nz -- shouldn't be grouped, different publications. We ran into this on a paper involving New Zealand media a couple of years ago.
James, could you add an exception for those? https://www.stuff.co.nz/manawatu-standard/ and https://www.stuff.co.nz/marlborough-express/ do in fact seem to be two different media sources :)
feedburner.com from Emily's email:
feeds.feedburner.com -- looks like an amalgamation of a lot of unrelated things, including some of our Robert Wood Johnson work?
Interesting catch! feedburner.com is an external RSS service, those media.url fields point to RSS feeds (e.g. https://feeds.feedburner.com/birthmoviesdeath) instead of the media source URL (e.g. https://birthmoviesdeath.com/) as they should.
James, could you either:
- For every feeds.feedburner.com URL, figure out which media URL does it point to (e.g. https://birthmoviesdeath.com/ in the example above) and add a fourth option in the "action" to simply update the URL? We could use this action to update a bunch of other URLs too, e.g. fix the "http://example.com:443" entries that we have discussed before; OR
- Add another exception to your notebook to skip feeds.feedburner.com URLs and leave this problem for future generations? :)
bizjournals.com from Emily's email:
Bizjournals.com -- shouldn't be grouped, different publications
James, could you add an exception for those?
buzzsprout.com from Emily's email:
buzzsprout.com -- looks like an aggregator of different sites
James, could you add an exception?
feed_actions
:
http://mont.ro/alt/rss/index_rss.xml
existed in the past, we could have gotten some stories from it. If we were to delete it, we'd have to delete referencing rows from feeds_stories_map
table, and thus from stories
table too. In the feeds
table, just like in media
, we should delete either 1) broken URLs that would never have worked, or just 2) nothing.DB Browser for SQLite makes it really easy to search for arbitrary strings in the SQLite3 database:
Thanks for the close look!
For example, table has a bunch of duplicates with different decisions about them, e.g. there are two entries with media_id = 43 (media_uuid = 43, media_url = 'http://newsok.com/') - the decision about one of them is to skip it, the other is destined to be merged into media source with media_id = 279980.
Fixed now
Address the what seems to be cruft catch-all media source with media_id = 21097.
Adding a manual step to make sure we skip
all these.
buzzsprout
,bizjournals
,stuff.co.nz
,feedburner
Handling these appropriately now.
What's the reason for deleting some (177) feeds again?
We decided not to delete them—refer to table feed_actions
for a list just of feeds to be merged into other feeds.
Here's the latest iteration of the database:
https://drive.google.com/file/d/1Z_ACwR8GeA_RDwKJjV8irAPTw9Io-rhP/view?usp=sharing
CSV with actions for the 15k top sources (also emailed to Emily + Fernando):
https://drive.google.com/file/d/1QH1DrIFrTW4fBNjlqTaeLtbXDPBtoLTz/view?usp=sharing
Permalink for latest version of the notebook: https://github.com/mediacloud/media-dedupe-notebook/blob/6e331c0c432ae2bedbbed398ea298572a47b726b/media_dedupe.ipynb
What happened to pamplinmedia.com
feeds from #189? For example, feed URL http://www.pamplinmedia.com/component/obrss/portland-tribune
belongs to a few sources:
select media_id
from feeds
where url = 'http://www.pamplinmedia.com/component/obrss/portland-tribune'
media_id
--
1391
1392
1394
1396
1397
1399
1400
1401
which were being merged into each other in 20210926
database (at least that was one of the duplicate actions):
select *
from deduped_media_final
where media_id in (1391, 1392, 1394, 1396, 1397, 1399, 1400, 1401)
media_id | media_uuid | media_url | action
-- | -- | -- | --
1391 | beavertonvalleytimes.com | http://beavertonvalleytimes.com/news/index.php | merge(86434)
1391 | beavertonvalleytimes.com | http://beavertonvalleytimes.com/news/index.php | skip
1392 | canbyherald.com | http://canbyherald.com/ | merge(86434)
1392 | canbyherald.com | http://canbyherald.com/ | skip
1394 | estacadanews.com | http://estacadanews.com/news/index.php | merge(86434)
1394 | estacadanews.com | http://estacadanews.com/news/index.php | skip
1396 | lakeoswegoreview.com | http://lakeoswegoreview.com/news/index.php | merge(86434)
1396 | lakeoswegoreview.com | http://lakeoswegoreview.com/news/index.php | skip
1397 | fgnewstimes.com | http://fgnewstimes.com/news/index.php | merge(86434)
1397 | fgnewstimes.com | http://fgnewstimes.com/news/index.php | skip
1399 | sandypost.com | http://sandypost.com/news/index.php | merge(86434)
1399 | sandypost.com | http://sandypost.com/news/index.php | skip
1400 | tigardtimes.com | http://tigardtimes.com/news/index.php | merge(86434)
1400 | tigardtimes.com | http://tigardtimes.com/news/index.php | skip
1401 | westlinntidings.com | http://westlinntidings.com/news/index.php | merge(86434)
1401 | westlinntidings.com | http://westlinntidings.com/news/index.php | skip
but in the 20211008
database they're all being skip
ped.
(The 20210926
had a deduped_media_final
table and 20211008
doesn't. If you're doing some manual tailoring or something for the "final" table, it has to be reflected in the notebook as well.)
What happened to pamplinmedia.com feeds?
Ugh sorry, in this most recent effort I forgot to pull in the media duplicates found via the feeds
table to override the initial decisions based just on the media
table. I've rectified that now by updating the notebook to read the manually created CSV I created manually which identifies duplicate media sources based on data from the feeds
table. That manually created CSV has been committed to the media-dedupe-notebook repo, as has the CSV listing feed merges (which appears in the sqlite db as the feed_actions
table).
If you're doing some manual tailoring or something for the "final" table, it has to be reflected in the notebook as well.
Sorry, you are correct; added those steps to the notebook.
Latest db: https://drive.google.com/file/d/16mKG-g-YJyAX-iArXrEr13OBZL65KHU1/view?usp=sharing
Notebook permalink: https://github.com/mediacloud/media-dedupe-notebook/blob/b5bda58229712fb215fc3cf116caa1c48d30007e/media_dedupe.ipynb
Latest DB file: https://drive.google.com/file/d/1AtHMp9kNlSavD6tOwHCzeyfSZ9jlDZHE/view?usp=sharing
Notebook permalink: https://github.com/mediacloud/media-dedupe-notebook/blob/68e84b8072d0ac93aef4d157e81fa2faa5a68cbd/media_dedupe.ipynb
Main feedback from Emily+Ryan+Fernando related mainly to the names of sources; no indication of sources being incorrectly merged.
Where is dup_feed_urls_merges.csv
coming from in feeds_dedupe.ipynb
? Is it the same file as media_dups_from_feeds.csv
?
(If by any chance you no longer find running the Python notebook convenient and instead went for having a separate set of scripts for generating the actual merge list, then consider getting rid of the notebook in the repo and commit the Python scripts that you're actually using instead. Just guessing here.)
Also, can those x
es (of different case - x
and X
) in delete_feed
be booleans (SQLite3 kinda supports those)?
Lastly, how do I see which feed gets merged into what? parent_feed_id
is missing from feed_actions
table.
Where is dup_feed_urls_merges.csv coming from in feeds_dedupe.ipynb? Is it the same file as media_dups_from_feeds.csv?
Sorry, yes—so what's happening there is that feeds_dedupe.ipynb
generates the file dup_feed_urls.csv
, which comprises media sources that share a common feed URL but haven't been flagged for merging via the media
table. From there, I used that CSV for the manual process of identifying duplicate media sources based on share feed URLs. That manually created file is dup_feed_urls_merges.csv
, which I then pulled into media_dedupe.ipynb
as media_dups_from_feeds.csv
(brain fart in not making those filenames the same; I've updated the feeds_dedupe
notebook to fix). feed_actions.csv
similarly comes via the manual process of identifying feeds that should be merged despite the media sources sharing the feed URL not needing to be merged.
Also, can those
x
es (of different case - x and X) in delete_feed be booleans?
Given that we decided to skip feed deletion for the time being, I just removed entries from the table.
How do I see which feed gets merged into what? parent_feed_id is missing from
feed_actions
table.
Updated that table.
Latest DB here: https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing
Permalink to latest iteration of media_dedupe.ipynb
: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/media_dedupe.ipynb
Permalink to latest iteration of feeds_dedupe.ipynb
: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/feeds_dedupe.ipynb
Looks good to me now!
So, the final (hopefully) list is:
Latest DB here: https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing Permalink to latest iteration of media_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/media_dedupe.ipynb Permalink to latest iteration of feeds_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/feeds_dedupe.ipynb
Thanks James!
Hey James,
Problem: we somehow ended up with lots of duplicate media sources in the
media
/feeds
tables which causes various problems both for us and the users, so we need to:media
again.To reduce scope creep, let's do first things first and find out / decide what should get merged into what (or deleted altogether).
Hal tried this three years ago in https://github.com/mediacloud/backend/issues/380 and came up with a draft list. Here's Hal's list of duplicates taken from https://github.com/mediacloud/backend/issues/380#issuecomment-395773905 (slightly formatted by me):
https://docs.google.com/spreadsheets/d/1YG-zfu-1J3_ZmHwi-d9XPILm8628Ypnn-9RYb2N61Ks/edit#gid=809541032
(https://github.com/mediacloud/backend/issues/380#issuecomment-395773905 has a comment on what column means what.)
Hal mentions that the list needs a review. Some observations of mine about the list:
.co.uk.
), e.g..edu.au
all get merged into a singleedu.au
"source";blogspot.com
get treated as separate media sources (e.g.donpesci.blogspot.com
) butblogspot.com.au
not so much;us.com
ornewoffers.info
domains; those probably got in through topic mapper? Maybe use this as an opportunity to get rid of them?feeds
table as some duplicates reside infeeds
and notmedia
(as I've complained about in https://github.com/mediacloud/backend/issues/189);*.tripod.com
media sources get merged into a singletripod.com
, same for some other blogging platforms, e.g.github.io
;num_stories
andnum_sentences
don't really resemble the truth; I've tried a few media sources with a reportednum_stories = 0
andnum_sentences = 0
, and all of them had at least a single story; I think this is because that information was derived frommedia_health
table, and updatingmedia_health
is probably broken;There are a bunch of rows in
media
that point to various static file stores (files.wordpress.com
,redditstatic.s3.amazonaws.com
) - those are funny because they typically have a few "stories" that are JPEG that we have treated as HTML and tried to extract:Given that we need to update the list anyway (or, rather, generate the final definitive "merge media ID A into media ID B" list somehow), and that three years have passed since the list got generated (i.e. new duplicate media sources got added), we need to come up with a new list of what gets merged into what.
Deliverable: a list of all media IDs (with names and other useful reference information perhaps) and decisions of what should be done with each: a)
skip
- media source stays in place; b)merge(media_id)
- media source's feeds / stories get merged intomedia_id
and the media source gets deleted; or c)delete
- media source and all of its feeds / stories get deleted.Proposed steps:
Find out how this original list got generated.
I don't know how exactly (using which script) did the list above get generated, and we didn't have an agreement on whether we're going to post PRs to each other back then. Some grepping points to https://github.com/mediacloud/backend/blob/master/apps/cron-print-long-running-job-states/bin/find_domain_media_dups.py (no idea why it's in
cron-print-long-running-job-states
app) which does something similar, and https://github.com/mediacloud/backend/blob/master/apps/tools/bin/dev/dedup_topic_media.pl which seems to deduplicatetopic_media
(I think). I'm also not sure how much human involvement was there in generating the list, i.e. did someone manually eyeballed each media source? Also, try asking Hal; if he replies, post the response here so that we have something to reference later.Dump
media
andfeeds
tables.It might be easier to write a script that finds duplicates when the data that's to be deduplicated is in your laptop.
(Probably) Get story / sentence counts for each media source from
stories
/story_sentences
tables.You might need this to evaluate which media sources to keep.
Modify the script to print out solutions for each media source.
Update the script to clean up the list, e.g.
.co.za
;us.com
;github.io
orblogspot.<ANY_TLD>
Duplicates / related: https://github.com/mediacloud/backend/issues/390, https://github.com/mediacloud/backend/issues/380, https://github.com/mediacloud/backend/issues/456, https://github.com/mediacloud/backend/issues/396, https://github.com/mediacloud/backend/issues/189