Come up with a final "what media source gets merged into what / deleted" list

mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.

http://www.mediacloud.org

GNU Affero General Public License v3.0

281 stars 87 forks source link

Come up with a final "what media source gets merged into what / deleted" list #799

Closed pypt closed 3 years ago

pypt commented 3 years ago

Hey James,

Problem: we somehow ended up with lots of duplicate media sources in the media / feeds tables which causes various problems both for us and the users, so we need to:

Find out / decide what should get merged into what (or deleted);
Merge duplicate sources into one; and
(Try to) prevent the duplicates from getting added to media again.

To reduce scope creep, let's do first things first and find out / decide what should get merged into what (or deleted altogether).

Hal tried this three years ago in https://github.com/mediacloud/backend/issues/380 and came up with a draft list. Here's Hal's list of duplicates taken from https://github.com/mediacloud/backend/issues/380#issuecomment-395773905 (slightly formatted by me):

https://docs.google.com/spreadsheets/d/1YG-zfu-1J3_ZmHwi-d9XPILm8628Ypnn-9RYb2N61Ks/edit#gid=809541032

(https://github.com/mediacloud/backend/issues/380#issuecomment-395773905 has a comment on what column means what.)

Hal mentions that the list needs a review. Some observations of mine about the list:

Second level TLDs are not well supported (except for .co.uk.), e.g. .edu.au all get merged into a single edu.au "source";
Blogs in blogspot.com get treated as separate media sources (e.g. donpesci.blogspot.com) but blogspot.com.au not so much;
A bunch of spam, e.g. for us.com or newoffers.info domains; those probably got in through topic mapper? Maybe use this as an opportunity to get rid of them?
The script that generated the list didn't take into an account the feeds table as some duplicates reside in feeds and not media (as I've complained about in https://github.com/mediacloud/backend/issues/189);
All *.tripod.com media sources get merged into a single tripod.com, same for some other blogging platforms, e.g. github.io;
num_stories and num_sentences don't really resemble the truth; I've tried a few media sources with a reported num_stories = 0 and num_sentences = 0, and all of them had at least a single story; I think this is because that information was derived from media_health table, and updating media_health is probably broken;

There are a bunch of rows in media that point to various static file stores (files.wordpress.com, redditstatic.s3.amazonaws.com) - those are funny because they typically have a few "stories" that are JPEG that we have treated as HTML and tried to extract:

```
mediacloud=# select * from story_sentences where media_id = 597342;
 story_sentences_id | stories_id | sentence_number |                sentence                | media_id |    publish_date     | language | is_dup 
--------------------+------------+-----------------+----------------------------------------+----------+---------------------+----------+--------
        15675221894 |  706979484 |               0 | ÿØÿá \Exif II* \u0098\u0082 7 ÿì Ducky |   597342 | 2014-02-06 07:00:00 | cs       | 
        15676666766 |  706991160 |               0 | ÿØÿà JFIF d d ÿì Ducky                 |   597342 | 2017-03-17 07:00:00 | cs       | 
(2 rows)
```

Given that we need to update the list anyway (or, rather, generate the final definitive "merge media ID A into media ID B" list somehow), and that three years have passed since the list got generated (i.e. new duplicate media sources got added), we need to come up with a new list of what gets merged into what.

Deliverable: a list of all media IDs (with names and other useful reference information perhaps) and decisions of what should be done with each: a) skip - media source stays in place; b) merge(media_id) - media source's feeds / stories get merged into media_id and the media source gets deleted; or c) delete - media source and all of its feeds / stories get deleted.

Proposed steps:

Find out how this original list got generated.

I don't know how exactly (using which script) did the list above get generated, and we didn't have an agreement on whether we're going to post PRs to each other back then. Some grepping points to https://github.com/mediacloud/backend/blob/master/apps/cron-print-long-running-job-states/bin/find_domain_media_dups.py (no idea why it's in cron-print-long-running-job-states app) which does something similar, and https://github.com/mediacloud/backend/blob/master/apps/tools/bin/dev/dedup_topic_media.pl which seems to deduplicate topic_media (I think). I'm also not sure how much human involvement was there in generating the list, i.e. did someone manually eyeballed each media source? Also, try asking Hal; if he replies, post the response here so that we have something to reference later.
Dump media and feeds tables.

It might be easier to write a script that finds duplicates when the data that's to be deduplicated is in your laptop.
(Probably) Get story / sentence counts for each media source from stories / story_sentences tables.

You might need this to evaluate which media sources to keep.
Modify the script to print out solutions for each media source.
Update the script to clean up the list, e.g.
- Fix treatment of second-level TLDs like .co.za;
- Get rid of spam websites, e.g. us.com;
- Add support for some more blogging platforms, e.g. github.io or blogspot.<ANY_TLD>
- ...

pypt commented 3 years ago

Those .co.zas and such are called second-level domains. There must be a list of those somewhere.

rahulbot commented 3 years ago

I had a very similar process in mind:

write a new canonical_domain(media_url:str) -> str function that accepts an existing media domain or a full story URL and returns the canonical “cleaned up” domain we will use for deciding which media source it is
run said function on all the media sources to generate a list with media_id, existing_url, canonical_url
process said list to group media_id’s by canonical_url and then decide which one of the should be the one to keep… then add a merge_to column to the list with that anointed media_id we will keep for each set

This (similar) high-level process I had in mind leaves me with two questions about your proposal:

How do you decide whether to delete a source? This seems pretty problematic and hard... how do we know that source wasn't part of some paper or study or something. If there is a good way to decide one we want to delete, could we instead just make sure it isn't pulling in new stories?
I'm not sure how feeds fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?

pypt commented 3 years ago

Well yeah, to accomplish the task (create a list of (media_id, decision) pairs) you need to come up with a way to represent each media source's URL in a canonical way. Our current attempt to do it is using the domain of the media source URL (e.g. https://www.nytimes.com/ gets canonicalized into nytimes.com), but I think this is a good opportunity to call this canonical form something else, e.g. media UID (not the best name though). I suspect that the canonical media identifier will be the source's domain name in 95% cases but let's leave some leeway for that 5% when we'll want to do something different for some sources.

How do you decide whether to delete a source? This seems pretty problematic and hard... how do we know that source wasn't part of some paper or study or something. If there is a good way to decide one we want to delete, could we instead just make sure it isn't pulling in new stories?

We can absolutely get rid of a bunch of spam websites (e.g. *.us.com ones) which I'm sure don't serve any purpose anywhere. As for the implementation, the workflow doing the merging will have to delete the row with the old media source anyway (independently from whether it's being merged into something or deleted).

I'm not sure how feeds fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?

Have a look at #189: sometimes similar URLs "hide" in the feed URLs and not media URLs. In this particular case, media URLs are different:

# select media_id, url from media where media_id in (select media_id from feeds where url ilike '%pamplinmedia%') order by media_id;

 media_id |                           url                            
----------+----------------------------------------------------------
     1391 | http://beavertonvalleytimes.com/news/index.php
     1392 | http://canbyherald.com/
     1394 | http://estacadanews.com/news/index.php
     1396 | http://lakeoswegoreview.com/news/index.php
     1397 | http://fgnewstimes.com/news/index.php
     1399 | http://sandypost.com/news/index.php
     1400 | http://tigardtimes.com/news/index.php
     1401 | http://westlinntidings.com/news/index.php
    30593 | http://www.newberggraphic.com#spider
    32581 | http://portlandtribune.com#spider
    86434 | http://www.pamplinmedia.com/#spider
   265617 | http://pamplinmediagroup.com/
   653250 | http://portlandtribune.com/the-times-news/
   655213 | http://portlandtribune.com/gresham-outlook-news/
   659712 | http://portlandtribune.com/forest-grove-news-times-news/
   662740 | http://portlandtribune.com/beaverton-valley-times-news/
   662752 | http://portlandtribune.com/estacada-news-news/
   662762 | http://portlandtribune.com/hillsboro-tribune-news/
   662765 | http://portlandtribune.com/regal-courier-news/
   662771 | http://portlandtribune.com/lake-oswego-review-news/
   662776 | http://www.madraspioneer.com/
   662783 | https://portlandtribune.com/clackamas-review-home
   662784 | http://www.molallapioneer.com/
   662789 | http://portlandtribune.com/oregon-city-news-news/
(24 rows)

but they all redirect to pamplinmedia.com for which we have a bunch (254) of identical feeds:

# select feeds_id, media_id, url from feeds where url ilike '%pamplinmedia%' order by media_id, url;

 feeds_id | media_id |                                      url                                       
----------+----------+--------------------------------------------------------------------------------
<...>
   349104 |     1391 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
   349083 |     1391 | http://www.pamplinmedia.com/component/obrss/canby-herald
   349099 |     1391 | http://www.pamplinmedia.com/component/obrss/central-oregonian
<...>
   349073 |     1392 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
   349052 |     1392 | http://www.pamplinmedia.com/component/obrss/canby-herald
   349068 |     1392 | http://www.pamplinmedia.com/component/obrss/central-oregonian
<...>
   349133 |     1394 | http://www.pamplinmedia.com/component/obrss/beaverton-valley-times
   349112 |     1394 | http://www.pamplinmedia.com/component/obrss/canby-herald
   349128 |     1394 | http://www.pamplinmedia.com/component/obrss/central-oregonian
(254 rows)

End result is that the crawler hits the poor pamplinmedia.com server too often so they got us blocked and we don't get anything from them anymore. There are issues like that with quite a few of other media sources too. Instead of disabling feeds for those media sources or something like that, we can use this opportunity to just merge them.

I don't think it's hard to implement - you compare the sets of feed URLs between two media sources, and if they're something like 70% identical, it means that one media source has to be merged into the other.

pypt commented 3 years ago

James, did Hal reply?

On Thu, Jul 29, 2021, 04:50 rahulbot @.***> wrote:

I had a very similar process in mind:

write a new canonical_domain(media_url:str) -> str function that accepts an existing media domain or a full story URL and returns the canonical “cleaned up” domain we will use for deciding which media source it is

run said function on all the media sources to generate a list with media_id, existing_url, canonical_url

process said list to group media_id’s by canonical_url and then decide which one of the should be the one to keep… then add a merge_to column to the list with that anointed media_id we will keep for each set

This (similar) high-level process I had in mind leaves me with two questions about your proposal:

How do you decide whether to delete a source? This seems pretty problematic and hard... how do we know that source wasn't part of some paper or study or something. If there is a good way to decide one we want to delete, could we instead just make sure it isn't pulling in new stories?

I'm not sure how feeds fit into this... I see your note that we sometimes have the same feed in multiple sources. If the same feed is in multiple sources that are not intended to be merged, then how do we decide what to do with them? Manual review? Can we just ignore the feeds issue, or separate it as a separate task?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-888736146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPNRJ5Z6V3AUFLTVO43T2CXWRANCNFSM5BAYFTXA .

jtotoole commented 3 years ago

not yet, will update here when i hear from him

jtotoole commented 3 years ago

Here's from Hal:

the find_domain_media_dups.py is the script that prints what you want -- a list of media that would be merged if you switch to unique media domains, at a glance, it treats as merge candidates any pair of media sources that either shares a normalized domain or are already connected by a dup_media_id reference in the media table.

the dedup_topic_media.pl script is the thing that generates the dup_media_id fields. it also works by comparing normalized media domains, but its purpose is just to set the work around dup_media_id field in the media table. that field is mostly just used to tell the topic spider to merge stories from media joined by dup_media_id (so if the topic spider thinks a story should belong to media_id 10, but media_id 10 has a dup_media_id of 1, then all stories that would have gone into media_id 10 instead go into media_id 1). I used to have to run dedup_topic_media all the time to clean up duplicate media in topics, but I only had to run it maybe a couple times in the last couple of years just because we already have the vast majority of dup media marked in big US media.

the vast majority of those dup_media_id values were set by running the dedup_topic_media script, but every once in a while I would just set the dup_media_id manually when I found a duplicate (mostly someone would report a duplicate media source within a topic that had run).

I ran these sorts of scripts by firing up a temporary custom container. you can see a super handy script in my home directory called 'bin/start-container.sh' that will start a container of a given image with all of the necessary env vars set for the container config. I think I also had another script to dump new config variables into start-container.sh, but in practice the config variables almost never changed. that start-container.sh script was probably my handiest and most frequently used productivity script. I'm sure I ran the find_domain_media_dups script by running start-container.sh in a screen session and just running the script in the screened container.

the big things off the top of my head as you jump into this are:

how to handle merging the media set memberships of duplicate media

how to handle merging the stories from duplicate media, keeping in mind that stories updates in postgresql are super slow because of all the triggers and references in the stories table

how to handle updating old topics that reference media and stories that will no longer exist once you start merging media

I've been working on a script that builds on Hal's find_domain_media_dups.py. I'll post a link on this thread once it's in a semi-presentable state.

jtotoole commented 3 years ago

On the subject of how to determine which source among duplicates becomes the parent (i.e., the one to which all others are merged):

in general, we choose the media source with the best content. it's often hard to figure that out. the dedup_topic_media script is my best attempt at trying to figure out when I can just automatically decide which media source to merge into the other.

otherwise, you just have to stare at them and make a decision manually, which is why that script runs in interactive mode.

I have an idea for attempting to automate this further which I'll demonstrate in my script; I'll post that here when ready

pypt commented 3 years ago

Could you post what you have so far, no matter how unfinished / deep in the trenches / a bunch of hacks it is? I.e. would you be able to:

Post a link to a script / notebook that does the deduplication; and
Post a link to some intermediate results that that script / notebook has generated?

pypt commented 3 years ago

in general, we choose the media source with the best content. it's often hard to figure that out. the dedup_topic_media script is my best attempt at trying to figure out when I can just automatically decide which media source to merge into the other.

I'd lean towards merging into a media source with the most "presentable" / canonical URL (domain), i.e. with

media_id,domain
1,subsection1.nytimes.com
2,subsection2.nytimes.com
3,www.nytimes.com
4,nytimes.com

we should make it easy for us and merge everything into 3,www.nytimes.com or 4,nytimes.com (even if that media ID has no stories / feeds and was empty before).

jtotoole commented 3 years ago

Notebook here. I'm still working on finding a way to run the dedupe_media function that doesn't take many hours, so I don't have a CSV to share at this point; any optimization advice you've got on that would be helpful.

rahulbot commented 3 years ago

Can you run with buckets of media_ids so that you can chunk it into jobs that parallelize?

pypt commented 3 years ago

<subjective_opinions>

I think it would be way easier if you loaded a simple CSV instead of reading data from PostgreSQL. Once loaded, you can speed up subsequent loads by storing the media table pickled, but if that ends up loading slowly, you can also look into shelve, another standard library which lets you use on-disk dictionaries as if they were residing in RAM. You can even store the input CSV in the GitHub repository itself - I think our media source list isn't confidential so we can just make that repo public together with the full list of media sources.

Same with pandas - why bother with it? You can have a list of dicts, or a dict of dicts (mapped by media_id, domain or something like that), a dict with media_id keys and dict values for actual rows complemented with a bunch of lists / sets / dicts / whatever that point to that reference media_id:

media = {
    # media_id
    1: {
        'media_id': 1,
        'name': '...',
    },
    2: {
        # ...
    },
}

# Set of media IDs to be deleted
media_to_be_deleted = {
    1,
    2,
    3,
}

# Map of what gets merged into what
media_to_be_merged = {
    # 2 gets merged into 1,
    2: 1,
    # ...
}

or some other native stuff which would be very fast and easy to use.

</subjective_opinions>

But again, how exactly do you come up with this merge list is up to you, I just thought I'd post some ideas.

Which part of that script is slow?

Only people on LinkedIn quote themselves, but I'll go ahead and do it:

Could you post what you have so far, no matter how unfinished / deep in the trenches / a bunch of hacks it is? I.e. would you be able to:

Post a link to a script / notebook that does the deduplication; and

Post a link to some intermediate results that that script / notebook has generated?

You've posted the notebook that you're working on, but we need the immediate results too. If the notebook is currently only able to filter out a bunch of spam domains and nothing else, that's great too, but still post the notebook that does nothing but spam domains + immediate media_id, name, url, decision CSV results here (and if the notebook does spam domain filtering and almost does other stuff but not quite yet, just comment out that other stuff for now, post the link to both the notebook that generates the decision CSV and the CSV itself, and continue working on that stuff while we all stare at the notebook and the CSV). We really need to iterate on this list together, and isolating yourself in the trenches is how this particular task gives people PTSD :)

jtotoole commented 3 years ago

Alrighty, I've managed to whittle it down to 118K sources among the 1.86M that have yet to be classified, and are therefore tagged "review." The others have all been tagged either as parent domains, children of parent domains, delete-able, or as blogging subdomains (and, consequently, new entries). You should be able to download the work-in-progress list here if you're signed into the Media Cloud Google org:

https://drive.google.com/file/d/1JYScr3b9AmEYLHH6x6RReVNLAeVrjnRf/view?usp=sharing

Notebook repo also updated. The part that was really slowing me down was taking the non-parent domains and searching among the parents to find their appropriate relations. I ended up using the pandasql package to do so via a LEFT JOIN on the parents and maybe-children as opposed to loops with hundreds of thousands of dict-search operations, which proved quite time-consuming.

@pypt, to your question, I originally loaded the media table into Postgres locally because it was handy for my workflow; I was exploring the data and querying it in Postico while I was iterating on the notebook. The multiplicity of data structures in the notebook maybe isn't the most comprehensible (postgres tables, DataFrames, pandasql queries, dict lists), but it helped to have options when thinking through the different operations required.

My next step is to spend more time Monday looking at the media sources tagged "review" and figuring out how to handle them. I'll update here with progress.

pypt commented 3 years ago

Can you make that Drive link shareable to "Anyone with a link"? Rahul doesn't have mediacloud.org account.

pypt commented 3 years ago

Google Sheets / LibreOffice / Excel all gave up on that CSV :)

Could you make it into a SQLite3 database and post it to the Google Drive? That way we could import it into something like https://sqliteonline.com/

Also, could you get rid of the following columns in the output:

that first unnamed column which probably is pandas's row number of something but it's confusing to have it near the media ID
full_text_rss
foreign_rss_links
content_delay
editor_notes
public_notes
is_monitored

jtotoole commented 3 years ago

Okay, I now have (at least a first pass at) all 1.86M rows in the media database flagged for either deletion, merger to a parent source, or no action (i.e. they're the parent source). Just shared Dropbox links with you guys for a CSV + sqlite dump—let me know if you have any trouble accessing. Some notes:

60,956 entries are flagged for deletion.
There are 1.54M unique media_uuids for the 1.86M media_ids in the dataset. That means roughly 260k duplicate sources to be merged.
You'll see in the latest version of the notebook that to decide what gets merged to what, I used the shortest url associated with a given media_uuid. That means, for example, that http://traveltips.usatoday.com/, http://www.usatoday.com/#spider, and http://onpolitics.usatoday.com/ are to be merged with http://usatoday.com.
I used the "shortest url" heuristic even for sources where Hal's old list had already made a determination. That was mainly to simplify things for this first pass, but it's something we can incorporate if we so choose. Hal told me that he just went with, "in general, the media source with the best content. it's often hard to figure that out... otherwise, you just have to stare at them and make a decision manually." Based on my spot-checking so far, using the shortest url for a media_uuid tracks with that, since what you end up getting is homepages as parents that subsume subdomains and section paths.

pypt commented 3 years ago

When you update it after some further changes, would you be able to both:

Could you re-include dup_media_id into the output?
Post the database itself and not the dump - SQLite3 databases are just simple files, that's one of their selling points, so one can just send a database to someone
Post it to Media Cloud's Google Drive with the "Anyone with a link" sharing permission - downloading the dump from Dropbox is a bit awkward (I had to sign in and such), plus if you delete it on your end, no one in the future would be able to access it

I've imported the dump into a database and uploaded it here, to the "Media deduplication immediate results" folder in the "Shared drives" / "Backend":

https://drive.google.com/drive/u/0/folders/1I77rqmbwtiqgoh3c7epFwt49WiTxwUtH

So then, as of Aug 24, 2021, the permalink to the script (notebook) that generates the results is:

https://github.com/mediacloud/media-dedupe-notebook/blob/560cfa63fdcd30b0f89e3654ef1f3a8fcd9dc31f/media_dedupe.ipynb

and the immediate results themselves are here:

https://drive.google.com/file/d/1V1WX3jZEvC0wGCgIfm4QP2DMBmYmB5Id/view?usp=sharing

A pretty decent app for browsing those SQLite3 databases is https://sqlitebrowser.org/dl/, e.g. it lets you search for arbitrary strings in all the columns, but there are many options out there.

Some observations:

Aside from URLs that are plain invalid (and that you have filtered out already (I think)), there are a lot of URLs that don't make sense, e.g. the ones with http://host:443/ - port 443 is for SSL (https://), and this URL basically says "connect to host's 443 (secure) port but try an unencrypted connection". I think the intention of whoever added those URLs here was to add https://host instead of http://host:443. Could you add a step somewhere early in your notebook to rewrite all of these :443 URLs to plain https://host? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid entries with both host and host:443.
Then maybe we should use this as an opportunity to update media.url (to be able to get rid of normalized_url column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?

The "merge everything into the first media source (as sorted by media_id) with the representative-looking URL" didn't quite work for latimes.com:

select * from deduped_media where media_id in (6, 18444);

media_id | url | media_uuid | name | normalized_url | action | is_parent | parent_id | parent_url
-- | -- | -- | -- | -- | -- | -- | -- | --
6 | http://www.latimes.com/ | latimes.com | LA Times | http://latimes.com/ | merge(18444) | no | 18444 | http://www.latimes.com
18444 | http://www.latimes.com | latimes.com | latimes.com | http://latimes.com/ | skip | yes | 18444 | http://www.latimes.com

i.e. it considers the URL with media_id = 6 as non-"representative" (due to the slash suffix?) and decides to merge everything into media_id = 18444.

Be wary of potential duplicates of those media URLs with #spider fragment - seemingly they sometimes include a single slash for path (http://host/#spider) and other times they don't (http://host#spider)
- Where do these #spider media URLs come from? It looks like they get added in the topic mapper. Here it tries to find a URL that does not yet exist in the media table, and if it fails at doing that, it adds the #spider fragment to the URL:
https://github.com/mediacloud/backend/blob/019e608ffccc7a62ee458fdaa4e53d5ea98c8e21/apps/topics-base/src/python/topics_base/media.py#L195-L209

and it supposedly adds those media sources for stories discovered through the topic mapper:

https://github.com/mediacloud/backend/blob/019e608ffccc7a62ee458fdaa4e53d5ea98c8e21/apps/topics-base/src/python/topics_base/media.py#L232-L277

It does try to reuse an existing media source, but is quick to give up and just slap on a new one because:

https://github.com/mediacloud/backend/blob/019e608ffccc7a62ee458fdaa4e53d5ea98c8e21/apps/topics-base/src/python/topics_base/media.py#L251-L252

Terse!

We need to get rid of these #spider media sources before we start merging all of them because otherwise the topic mapper will just fill up the table with a bunch of new duplicates in a day or so. Could you email Hal and ask him what does that comment above mean?

In the notebook itself, you have a list of domains that "have no relation of media":

# sites that bear no relation to media
GARBAGE_DOMAINS = ['splashthat.com', 'lin.is', 'mail.ru', 'readthedocs.io', 'myshopify.com', 
               'hotels-pt.net', 'conferenceseries.com', 'hotelescotedazur.com', 'deviantart.com', 
               'top-hotels-gr.com', 'sourceforge.net', 'podbean.com', 'webstatsdomain.org', 'hotels-sicily.net',
               'hotels-istria.net', 'zoom.us', 'schoolloop.com', 'oursite.com', 'translate.goog', 'arcgis.com',
               'bandcamp.com', '', 'square.site', 'liveinternet.ru', 'netdna-cdn.com', 'lnk.to', 'convey.pro',
               'photobucket.com', 'netdna-ssl.com', 'sched.com', 'itch.io', 'apache.org', 'newhookupsite.biz'
               'as.me', 'freeadsaustralia.com', 'lpages.co', 'freeadsinuk.co.uk', 'app.link', 'custhelp.com',
               'memberclicks.net', 'webex.com', 'dropboxusercontent.com', 'debian.org', 'googleusercontent.com',
               'sourceforge.io', 'hyatt.com', 'marriot.com', 'eventbrite.com', 'seamlessdocs.com', 
               'ticketspice.com', 'us.com', 'newoffers.info', 'craigslist.org', 'craigslist.co.th', 
               'askdefine.com', 'webstatsdomain.org', 'translate.goog', 'qualtrics.com', 'zendesk.com',
               'business.site']

Unfortunately, counterintuitively, and (again) thanks to the topic mapper, the rows in media don't have to have much to do with news media. Topic mapper spiders the web looking for links from news articles to other news articles, and it is unable to determine whether the newly found URL (story) comes from a news media or not. Every story has to have a media source as per the schema design, so to accommodate a new story, it has to find / create a media source for every story that was found through spidering.

Therefore, GARBAGE_DOMAINS should only include domains that are truly just garbage, e.g. V*agra shilling websites. Every other media "source" is legit there.

Do you look into feed URLs when looking for duplicate media sources? Here's an example: https://github.com/mediacloud/backend/issues/189. Is there a way to deduplicate those too?

Thanks for your work on this, really taking shape! I'll stare at it a bit more later.

jtotoole commented 3 years ago

Could you add a step somewhere early in your notebook to rewrite all of these :443 URLs to plain https://host? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid entries with both host and host:443.

Then maybe we should use this as an opportunity to update media.url (to be able to get rid of normalized_url column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?

This bears discussion over Hangouts, but I think that instead of modifying the existing url values in the table, we should add a new media_uuid column to the schema and parse that value on the backend for incoming stories/sources (a la @rahulbot's canonical_domain(media_url:str) -> str suggestion) . In that case, tldextract would dedupe http://james.com, https://james.com, and http://james.com:443 into the single media_uuid james.com, and we wouldn't care about whether :443 is in a url.

In any case, this should be a separate task IMO, although one deployed before the deduping.

...i.e. it considers the URL with media_id = 6 as non-"representative" (due to the slash suffix?)

Correct, technically the notebook just looking for the shortest url string once www has been removed. I can add a step to also remove / from the shortest-url calculation.

Be wary of potential duplicates of those media URLs with #spider fragment...

Emailed Hal. I suspect the /#spider vs. spider issue is the result of the general fact that the presence of trailing slashes on urls is inconsistent—e.g., both http://latimes.com and http://latimes.com/ are in the media table. This is perhaps another argument for introducing media_uuid to the schema and using it as the distinguishing field for deduplication; we can then leave url, with all its goofiness, aside.

GARBAGE_DOMAINS should only include domains that are truly just garbage, e.g. V*agra shilling websites...

We should talk in the Wednesday group meeting about what the heuristic for GARBAGE_DOMAIN status should be. For example, if we say a V*agra-shilling website falls into this category, what's the conceptual distinction between that and a site shilling discount hotel rooms, such as hotels-sicily.net, or a site built for RSVPs to corporate events? (Sidenote: there are ~8000 hotel booking-related sites in the media table) In the future, perhaps we have the data curator hire spot-check new media sources on a weekly basis or something.

Do you look into feed URLs when looking for duplicate media sources?

At present, no, but certainly those could be deduplicated in similar fashion. I think this also should be a separate task if that makes sense to you.

Thanks for this thoughtful guidance! Updated database + CSV export to come in Google Drive later today.

pypt commented 3 years ago

Could you add a step somewhere early in your notebook to rewrite all of these :443 URLs to plain https://host? That way we'll end up with less hosts to merge into, e.g. there won't have to be media_uuid entries with both host and host:443.

Then maybe we should use this as an opportunity to update media.url (to be able to get rid of normalized_url column later), within the same workflow that we'll later run? What do you think? Or should we do this as a separate task?

This bears discussion over Hangouts, but I think that instead of modifying the existing url values in the table, we should add a new media_uuid column to the schema and parse that value on the backend for incoming stories/sources (a la @rahulbot https://github.com/rahulbot's canonical_domain(media_url:str) -> str suggestion) . In that case, tldextract would dedupe http://james.com, https://james.com, and http://james.com:443 into the single media_uuid james.com, and we wouldn't care about whether :443 is in a url.

In any case, this should be a separate task IMO, although one deployed before the deduping.

Those would be two related columns that have different purpose:

media.url - representative URL that gets shown to the user and that someone can click on and go to that website;
media.media_uuid - unique identifier (almost always a hostname) used by us internally to prevent duplicates ending up in the media table.

So we'll definitely have to add the media_uuid column at some point later on, set media UUIDs for all existing media, and add some code somewhere to check for duplicates before a media source gets added, but given that you'll be normalizing urls anyway at this point for this specific task (coming up with a list of what should get merged into what), e.g. fixing the http://host:443 entries, I figured that we might as well update them in the media table.

But sure, we can leave them for later too.

GARBAGE_DOMAINS should only include domains that are truly just garbage, e.g. V*agra shilling websites...

We should talk in the Wednesday group meeting about what the heuristic for GARBAGE_DOMAIN status should be. For example, if we say a V*agra-shilling website falls into this category, what's the conceptual distinction between that and a site shilling discount hotel rooms, such as hotels-sicily.net, or a site built for RSVPs to corporate events? (Sidenote: there are ~8000 hotel booking-related sites in the media table) In the future, perhaps we have the data curator hire spot-check new media sources on a weekly basis or something.

This is unfortunately how this is currently architected - all stories need a parent media ID, and topic mapper ends up with stories with no parent media source - and rearchitecting it for it to be some other way is six months of work. It's not a superb architecture but this is how things currently work, and it would be super hard to change that.

I think we don't particularly care that much about the Xanax websites in the media table - if it's easy to spot (which it is), and easy to delete (we'll have to have a media source removal workflow step in any case), and it's absolutely positively useless to us (e.g. it's a spam website), then we should get rid of it, otherwise leave them be - they're just rows in a table. Hotel media sources and such are potentially useful - what if a bunch of news websites link to a specific funny / problematic Airbnb ad? :) (They never do, but that's at least the idea behind the topic mapper.)

What is an actual pain point for users (and the crawler), and the reason why we do this deduplication in the first place, is that there are X copies of nytimes.com, Y copies of latimes.com and Z copies of usatoday.com - actual media sources with actual news that belong to collections.

Do you look into feed URLs when looking for duplicate media sources?

At present, no, but certainly those could be deduplicated in similar fashion. I think this also should be a separate task if that makes sense to you.

You'll have to look into feed URLs as part of this task too unfortunately as they cause problems too - in the linked task, we have something like 20 copies of pamplinmedia.com which is a legit news website, therefore crawler hits that website a gazillion times a day so they got us blocked.

Thanks for this thoughtful guidance! Updated database + CSV export to come in Google Drive later today.

No need for the CSV, just the pre-imported database + permalink to the current version of the notebook!

jtotoole commented 3 years ago

I don't have permission to upload to that Google Drive folder at present, but here's the latest deduping:

https://drive.google.com/file/d/1BYePH2w6RhXHPU-eivYF1ZJDhlFBAFLB/view?usp=sharing

Notebook permalink here:

https://github.com/mediacloud/media-dedupe-notebook/blob/master/media_dedupe.ipynb

I see what you mean now about the URLs—in all this database spelunking I'd forgotten about how this stuff actually gets displayed to users 😛 . I've added in the notebook a step to address the :443 issue.

Re: GARBAGE_DOMAINS, I still don't really get the distinction; you could have a spam site listing discount hotel rooms, or conversely you could have blogs linking to a particularly comical Xanax broker. In any case, I think it would be valuable to hear from the researchers on this.

Here's from Hal on the #spider issue:

the topic spider has to autogenerate media sources for stories for which it cannot find a matching media source (it tries to find a matching media source based on the hostname in the url of a story that itself did not match an existing story in the database by url).

for some reason that is long lost to the fog of war, it used to be possible for some reason for the topic spider to try to create a media source with a url that already existed in the database. my infinitely wise answer was just to stick '#spider' on the end of the spider generated media source url, so you could end up with a 'http://nytimes.com/' media source and a 'http://nytimes.com/#spider' media source. I think one issue was that some media source are intentionally ignored by the topic spider for the purposes of discovery (such as media sources with foreign_rss_links set to true, which indicates that the media source should be ignored because it includes stories with urls that belong to foreign domains).

in general, these #spider and non-#spider media sources should be merged, but there's probably not a hard rule about which one you should use as the master.

Fortunately I think our shortest-url approach should work well enough as such a hard rule.

pypt commented 3 years ago

I don't have permission to upload to that Google Drive folder at present, but here's the latest deduping:

https://drive.google.com/file/d/1BYePH2w6RhXHPU-eivYF1ZJDhlFBAFLB/view?usp=sharing

My bad, could you try again?

https://drive.google.com/drive/u/0/folders/1I77rqmbwtiqgoh3c7epFwt49WiTxwUtH

(I've uploaded your most recent SQLite3 database to it already.)

Also, could you add a "-yyyymmdd" suffix to the database filenames?

I haven't looked at the actual database yet, I'll do it a bit later.

Notebook permalink here:

https://github.com/mediacloud/media-dedupe-notebook/blob/master/media_dedupe.ipynb

This link points to the latest version of the file (notebook) in the "master" branch, so when you'll update the notebook, this link will no longer point to the previous version of it. We'll need that to be able to do comparisons between different iterations of the notebook.

To get a permalink, one can click on three dots at the top right of the GitHub page and select "Copy permalink".

I see what you mean now about the URLs—in all this database spelunking I'd

forgotten about how this stuff actually gets displayed to users 😛 . I've added in the notebook a step to address the :443 issue.

Thanks James!

Re: GARBAGE_DOMAINS, I still don't really get the distinction; you could have a spam site listing discount hotel rooms, or conversely you could have blogs linking to a particularly comical Xanax broker. In any case, I think it would be valuable to hear from the researchers on this.

Let me know if it's still unclear to you and I'll try to rephrase it.

Here's from Hal on the #spider issue:

the topic spider has to autogenerate media sources for stories for which it cannot find a matching media source (it tries to find a matching media source based on the hostname in the url of a story that itself did not match an existing story in the database by url).

for some reason that is long lost to the fog of war, it used to be possible for some reason for the topic spider to try to create a media source with a url that already existed in the database. my infinitely wise answer was just to stick '#spider' on the end of the spider generated media source url, so you could end up with a 'http://nytimes.com/' media source and a 'http://nytimes.com/#spider' media source. I think one issue was that some media source are intentionally ignored by the topic spider for the purposes of discovery (such as media sources with foreign_rss_links set to true, which indicates that the media source should be ignored because it includes stories with urls that belong to foreign domains).

in general, these #spider and non-#spider media sources should be merged, but there's probably not a hard rule about which one you should use as the master.

This is one of the reasons why this task resembles 4chan's "1000 US Marines vs a stick" problem:

2dd

How this could be a problem is that if we merge medium X (url = ' http://host/#spider' and foreign_rss_links = 'f') into medium Y (url = ' http://host/' and foreign_rss_links = 't') and make the topic mapper no longer create the #spider media sources, it might break the topic spider's discovery (I wonder what that means).

Could you re-add the "foreign_rss_links" column to your output so that we'll get to see how many of media sources (especially the ones to be merged) will have it set to true? I'll try to find out how and why media with foreign_rss_links is ignored.

pypt commented 3 years ago

GitHub stripped my meme:

2dd

jtotoole commented 3 years ago

Dumped the latest db version to Google Drive, notebook permalink here: https://github.com/mediacloud/media-dedupe-notebook/blob/38f77c06c9087780f27ab6ea3a7a2b3f4f79216d/media_dedupe.ipynb

Looks like 1830 urls in the table have foreign_rss_links; of those 108 are to be merged with parent sources

pypt commented 3 years ago

Thank you again James! Give me a day or two to stare at them for a little bit and get my shower thoughts together.

On Thu, Aug 26, 2021 at 12:12 AM jtotoole @.***> wrote:

Dumped the latest db version to Google Drive, notebook permalink here: https://github.com/mediacloud/media-dedupe-notebook/blob/38f77c06c9087780f27ab6ea3a7a2b3f4f79216d/media_dedupe.ipynb

Looks like 1830 urls in the table have foreign_rss_links; of those 108 are to be merged with parent sources

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-905878535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPNSGDXSBHOZGBQVMQDT6VMCZANCNFSM5BAYFTXA .

-- Linas Valiukas www.mediacloud.org

jtotoole commented 3 years ago

Sure thing, we can huddle about it during our 1:1

pypt commented 3 years ago

Meanwhile could you look at the "feeds" table (if you haven't already) - read through the linked issue, using just the "feeds" table try to find media_ids that have at least one identical feed URL between them, and post that list somewhere (a new table in the SQLite3 database?).

On Thu, Aug 26, 2021, 19:26 jtotoole @.***> wrote:

Sure thing, we can huddle about it during our 1:1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-906559877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPM5KOQYGXMG2EFHSPLT6ZTKHANCNFSM5BAYFTXA .

jtotoole commented 3 years ago

I did some more tinkering to identify the duplicate sources hidden within the feeds table:

https://github.com/mediacloud/media-dedupe-notebook/blob/4ba3525f86804dbbe44f569f3a7cd07625e5b9e5/feeds_dedupe.ipynb

There we find 3,635 media_ids that can be deduped to 1,353. I added a new table to the database (updated in Google Drive) where one can browse these records; it has the unwieldy name feeds_with_multiple_parent_media.

The tricky part here is that I'm not sure there's an automate-able heuristic we can use to identify the logical parent in these groups. For example, newstalk770.com and cknw.com share the feed URL https://globalnews.ca/feed/; it's only by actually visiting the sites that you can make a judgment about which one should be preeminent (in that case the first 404s and the second redirects to a globalnews.ca relative path).

Additionally, in some of these cases we don't actually want to do the deduplication. For instance, a group of 17 regional newspapers owned by the same conglomerate have a common feed URL https://www.digitalfirstmedia.com/feed/, but the content on their sites (e.g. sentinelandenterprise.com, lowellsun.com) is different. That's because while they share that particular feed url, they also have separate, per-publication feed URLs (e.g. sentinelandenterprise.com/feed, lowellsun.com/feed), the media/feeds relationship being one-to-many.

My suggestion is that, given that this is a relatively small number of additional duplicates ( < 1% of the total discovered via the media table) that we leave these aside for now, and later I or the incoming data curator can go through the list manually to make merge decisions.

pypt commented 3 years ago

I added a new table to the database (updated in Google Drive) where one can browse these records; it has the unwieldy name feeds_with_multiple_parent_media.

Can you now and in the future post an URL to the specific database file that you're referring to?

# it turns out there are lots of disparate sources whose feed URL points to rdfs.org/xmlns.com/w3.org, # which host sample specs for such feeds; also lots of sites with feed URLs pointing to wordpress template feeds # side note: maybe we should delete these sources from the media table?

What's a "wordpress template feed"? Could you give an example?

Do those feeds point to "topic" media sources (the ones with #spider)?

Dunno about deleting them, I think deduplicating the useful ones (i.e. the ones with working feeds and / or in collections) is more important, and these ones are just rows in a table.

In [157]: feeds_with_multiple_parent_media_ids Out[157]: [{'http://www.kp.ru/rss/allsections.xml': [1760, 47226, 561409]}, {'http://www.mk.ru/rss/index.xml': [1775, 385506]}, <...> {'http://rss.cbc.ca/lineup/canada-ottawa.xml': [7333, 83917]}, ...]

What's what in this output? Is there a way to present it more nicely, e.g. a table?

It's also truncated at the end. If it's too many rows for the notebook to handle, could you "publish" it as a SQLite3 table?

The tricky part here is that I'm not sure there's an automate-able heuristic we can use to identify the logical parent in these groups. For example, newstalk770.com and cknw.com share the feed URL https://globalnews.ca/feed/; it's only by actually visiting the sites that you can make a judgment about which one should be preeminent (in that case the first 404s and the second redirects to a globalnews.ca relative path).

Don't we have a media source for globalnews.ca (if we don't, that's not great)?

How many of those un-automatable potential merges do we have?

Additionally, in some of these cases we don't actually want to do the deduplication. For instance, a group of 17 regional newspapers owned by the same conglomerate have a common feed URL https://www.digitalfirstmedia.com/feed/, but the content on their sites (e.g. sentinelandenterprise.com, lowellsun.com) is different. That's because while they share that particular feed url, they also have separate, per-publication feed URLs (e.g. sentinelandenterprise.com/feed, lowellsun.com/feed), the media/feeds relationship being one-to-many.

Thank you for spotting this. We still have to get rid of that shared feed somehow. What would be your proposed heuristics for that?

My suggestion is that, given that this is a relatively small number of additional duplicates ( < 1% of the total discovered via the media table) that we leave these aside for now, and later I or the incoming data curator can go through the list manually to make merge decisions.

Sorry, but no:

this is still the same task - deduplicating media sources (not just rows in the media table) - even though you seem to disagree
those duplicate feeds do cause problems with the crawler (as detailed in #189)
a lot of duplicates that one can derive from media alone aren't that impactful, i.e. it's technically great that we merge longdeaddomain.com and longdeaddomain.com/#spider together, but the ones that do have feeds from which we actually fetch content are the most important, and we really want to get those in order if we can

jtotoole commented 3 years ago

https://drive.google.com/file/d/1PQBTXHGzHjcvVr-A2R-4AeblTNekrdYx/view?usp=sharing

^That latest SQLite database has the table feeds_with_multiple_parent_media, meaning feed urls that are associated with at least two different media sources flagged as parents based on the deduping of the media table. There's one entry per feed_id, even though there are many duplicate feed urls in the table.

Here's an example of what I mean by "WordPress template" feed:

You can see that full list of results by running the query

SELECT m.url AS media_url, f.feeds_id, f.media_id, f.url AS feed_url, f.occurences, f.type, f.active 
FROM duplicate_feeds f 
JOIN deduped_media m
ON m.media_id = f.media_id
WHERE f.url = "https://wordpress.com/blog/feed/"

There are some #spider media URLs in that result set, but most are not.

Yes, globalnews.ca also has an entry in the media table. It looks like at some point the URLs for their local affiliates (e.g. cknw.com) started redirecting to globalnews.ca section pages (https://globalnews.ca/radio/cknw). The old affiliate URLs remain in the media table, however.

Based on my investigation there are 3,635 parent media sources associated with 1,353 shared feed URLs (so media sources A, B, and C share the feed URL foo.com, D and E share bar.com, etc.). As with the case of Digital First Media, the conclusion upon examination of some groups will be that no media merging needs to happen, but that a non-substantive shared feed (https://www.digitalfirstmedia.com/feed/) should be deleted. I'm not sure of a way to make decisions for each of those 1,353 groups besides manually checking each of the media URLs and feed URLs, but of course there may be a method I haven't thought of. Unless you've got another idea I'll find a two-hour trance mix and start going through that list.

pypt commented 3 years ago

Thanks James, give me a day to stare at it for a little while.

On Tue, Aug 31, 2021 at 11:14 PM jtotoole @.***> wrote:

https://drive.google.com/file/d/1PQBTXHGzHjcvVr-A2R-4AeblTNekrdYx/view?usp=sharing

^That latest SQLite database has the table feeds_with_multiple_parent_media, meaning feed urls that are associated with at least two different media sources flagged as parents based on the deduping of the media table. There's one entry per feed_id, even though there are many duplicate feed urls in the table.

Here's an example of what I mean by "WordPress template" feed:

[image: Screen Shot 2021-08-31 at 3 45 57 PM] https://user-images.githubusercontent.com/6501236/131566269-c9fd33a1-80e1-4649-b864-2970274b0888.png

You can see that full list of results by running the query

SELECT m.url AS media_url, f.feeds_id, f.media_id, f.url AS feed_url, f.occurences, f.type, f.active FROM duplicate_feeds f JOIN deduped_media m ON m.media_id = f.media_id WHERE f.url = "https://wordpress.com/blog/feed/"

There are some #spider media URLs in that result set, but most are not.

Yes, globalnews.ca also has an entry in the media table. It looks like at some point the URLs for their local affiliates (e.g. cknw.com) started redirecting to globalnews.ca section pages ( https://globalnews.ca/radio/cknw). The old affiliate URLs remain in the media table, however.

Based on my investigation there are 3,635 parent media sources associated with 1,353 shared feed URLs (so media sources A, B, and C share the feed URL foo.com, D and E share bar.com, etc.). As with the case of Digital First Media, the conclusion upon examination of some groups will be that no media merging needs to happen, but that a non-substantive shared feed ( https://www.digitalfirstmedia.com/feed/) should be deleted. I'm not sure of a way to make decisions for each of those 1,353 groups besides manually checking each of the media URLs and feed URLs, but of course there may be a method I haven't thought of. Unless you've got another idea I'll find a two-hour trance mix and start going through that list.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mediacloud/backend/issues/799#issuecomment-909597005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMNPOJI6QFZN7GQVU4X43T7UZ4DANCNFSM5BAYFTXA .

-- Linas Valiukas www.mediacloud.org

jtotoole commented 3 years ago

👍

jtotoole commented 3 years ago

Updated database with a top_sources table comprising the ~15k sources from our most important collections (per Emily):

https://drive.google.com/file/d/1lO8BaSyylbpsYDY57KFDw7tIZDFXEFNa/view?usp=sharing

pypt commented 3 years ago

James, could you address the Emily's notes? I've sent a reply with my proposed solutions.

Have you decided on a course of action regarding the duplicate feeds or do you want me to take a look?

jtotoole commented 3 years ago

I can address Emily's notes—I'll post an updated permalink + sqlite db with the changes reflected. In the meantime, I've been going through the duplicate feeds and associated sites manually to make merge determinations. I've spent a lot of time poking at it and haven't come up with a way to automate. Imagine the case of foo.com and bar.com sharing the RSS feed baz.com/rss: in some instances foo.com is defunct and redirects to bar.com, in some cases the other way round, and in some cases both foo.com and bar.com are valid sites associated with the nonfunctional feed baz.com/rss in addition to their own distinct feeds. This is the situation we face. I've got a spreadsheet with merge decisions and delete-this-feed decisions here:

https://docs.google.com/spreadsheets/d/179zq_SfTTaLoQ8abIvgt4nv2pOt7Yq40IsrA2XIRC4M/edit?usp=sharing

It's very tedious work, but I'm through about 1,000 of the 2723 sites that need to be reviewed, so hopefully I'll be done by the end of the week.

jtotoole commented 3 years ago

Okay, I think we may be ready for phase 2. I've incorporated the feedback from Emily+Linas and added a couple more clean-up steps of my own in the notebook:

https://github.com/mediacloud/media-dedupe-notebook/blob/ad836184f57ae6429d11b5301e42f6fed783095e/media_dedupe.ipynb

New sqlite database here: https://drive.google.com/file/d/1lO8BaSyylbpsYDY57KFDw7tIZDFXEFNa/view?usp=sharing

Feeds-based deduplication CSV here (to be run after the initial deduplication based on the media table): ~~https://github.com/mediacloud/media-dedupe-notebook/blob/ad836184f57ae6429d11b5301e42f6fed783095e/media_dedupe.ipynb~~

https://drive.google.com/file/d/1Y4UBDbZVcV3_Xmz1MOHnZpnb2F8tfGAN/view?usp=sharing

pypt commented 3 years ago

I think the link to the feeds CSV is wrong, can't find it in the repo at least.

Could you merge the decision lists on duplicate media + feeds into one? Those two lists are here to serve the same purpose, and merging / deleting feeds is a subset of merging / deleting media sources, so we should do it in one go.

jtotoole commented 3 years ago

I think the link to the feeds CSV is wrong

I'm so dumb, sorry—corrected above.

Here's a final media merge list, comprising duplicates found in both feeds and media:

https://drive.google.com/file/d/1LKjyOWiCTsrstNADMwQmwUmVm8Sw4hRR/view?usp=sharing

Feeds to be deleted here:

https://drive.google.com/file/d/1KvKqkNlbXJWtgj5_X-3RkJkKWLvT6tfp/view?usp=sharing

pypt commented 3 years ago

Could you please:

Import those CSVs into a single SQLite3 database just like you did with the rest of the huge datasets? That 100+ MB CSV file is quite unusable to me right now.
Add a date suffix to all of the newly added files just like we did with all the other files? While at that, could you use the YYYYMMDD prefix instead of MMDDYYYY? It's consistent with how the rest of the files are named, and YYYYMMDD just gets things sorted more nicely.
Instead of deleting feeds, you probably want to "merge" a majority (all?) of them into other feeds, i.e. deleting that one row from the "feeds" table is easy but there could be stories associated with the feed that gets deleted so you want to move those stories so some other feed first.

Lastly, I still think that we need to work out a way to merge the media source decision list and feed decision list into one to make it into a single "list of actions that have to be done to get everything deduplicated nicely" list that would be easy to act upon in the so-called stage two of the project.

To do that, we have to "sneak peek" into what doing the actual media / feed merging will entail in a technical sense.

For example, the workflow to merge one feed (src_feeds_id) into another (dst_feeds_id) will look as follows:

Move all rows that reference the feeds table with <src_feeds_id>:
1. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the downloads table
2. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_stories_map table, taking into account that there could be duplicates
3. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the scraped_feeds table, taking into account that there could be duplicates
4. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_from_yesterday table, taking into account that there could be duplicates
5. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_tags_map table, taking into account that there could be duplicates
Remove the row with feeds_id = <src_feeds_id> from the feeds table:
1. Remove rows with feeds_id = <src_feeds_id> from the downloads table - there shouldn't be any left as we've just merged them
2. Remove rows with feeds_id = <src_feeds_id> from the feeds_stories_map table - there shouldn't be any left as we've just merged them
3. <...>
4. Remove rows with feeds_id = <src_feeds_id> from the feeds_tags_map table - there shouldn't be any left as we've just merged them
5. Remove the actual row from feeds.

And the workflow to merge one media source (src_media_id) into another (dst_media_id) will look something like this:

Drop (or at least set to NULL) media.dup_media_id column as it's not useful anymore
Move all rows that reference the media table with <src_media_id>:
1. Set media_id = <dst_media_id> on rows with media_id = <src_media_id> in the media_rescraping table, taking into account that there could be duplicates
2. Set media_id = <dst_media_id> on rows with media_id = <src_media_id> in the media_stats table, taking into account that there could be duplicates
3. Set media_id = <dst_media_id> on rows with media_id = <src_media_id> in the feeds table, taking into account that there could be duplicates; those duplicate feeds need to be merged and not simply deleted because there could be stories and such associated with those feeds; so essentially at his point we have to - again - merge a bunch of <src_feeds_id> into <dst_feeds_id> for every feed in <src_media_id>:
  1. Move all rows that reference the feeds table with <src_feeds_id>:
    1. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the downloads table
    2. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_stories_map table, taking into account that there could be duplicates
    3. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the scraped_feeds table, taking into account that there could be duplicates
    4. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_from_yesterday table, taking into account that there could be duplicates
    5. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_tags_map table, taking into account that there could be duplicates
  2. Remove the row with feeds_id = <src_feeds_id> from the feeds table.
    1. Remove rows with feeds_id = <src_feeds_id> from the downloads table - there shouldn't be any left as we've just merged them
    2. Remove rows with feeds_id = <src_feeds_id> from the feeds_stories_map table - there shouldn't be any left as we've just merged them
    3. <...>
    4. Remove rows with feeds_id = <src_feeds_id> from the feeds_tags_map table - there shouldn't be any left as we've just merged them
    5. Remove the actual row from feeds.
4. (do media_id merging magic to a bunch of other tables)
5. (do some more media_id magic to Solr as media_ids get stored there too)
Remove the row with media_id = <src_media_id> from the media table:
1. Remove rows with media_id = <src_media_id> from the media_rescraping table - there shouldn't be any left as we've just merged them
2. <...>

See the overlap? Merging one feed into another is an integral part of the media merging, just like deleting the media source is, so we can use the same exact workflow to merge / delete both media sources and the individual feeds. The only issue is to think of a convenient way to define inputs for such a workflow.

As always, let me know if you have any questions or need clarification.

jtotoole commented 3 years ago

Latest DB here:

https://drive.google.com/file/d/1miYPjEGS7DNBHLsoj6oJBk0-sW3fxO0a/view?usp=sharing

The two tables relevant for the merging are deduped_media_final, which contains action steps for media merging, and feed_actions, which lists feeds that are to be merged with other feeds or, if the column value parent_feed_id is NULL, should be deleted. If we decide we actually don't want to delete any feeds, we can ignore those rows. Per our discussion on Friday, I decided to leave aside the task of deleting spam media sources.

pypt commented 3 years ago

Some weirdness here and there:

deduped_media_final:
- Get rid of duplicates.
  
  For example, table has a bunch of duplicates with different decisions about them, e.g. there are two entries with media_id = 43 (media_uuid = 43, media_url = 'http://newsok.com/') - the decision about one of them is to skip it, the other is destined to be merged into media source with media_id = 279980.
- Make sure media sources with a decision to merge(...) point only to target media sources with action being skip.
  
  Given that we'll do the media merging in parallel, we can't decide to merge media source A into media source B and then media source B into media source C, or media source A into media source B and then delete media source B, as we don't know in which order the media sources will be merged.
  
  For example:
  - Media source with media_id = 0 (media_uuid = NULL, media_url = 'http://dev.null') is set to be merged into media source with media_id = 21097 (media_uuid = NULL, media_url = 'http://http:'), however media source with media_id = 21097 is set to be deleted. If one workflow will attempt to merge media source A into media source B and then another workflow will be busy deleting media source B, we might run into operational issues. Could you look and address cases like that?
  - A bunch of newsok.com duplicates (i.e. media sources with media_id IN (20653, 164137, 174357, 370924, 489444, 645972, 888425, 1438178)) are set to be merged into media source with media_id = 43 (media_uuid = 'newsok.com', media_url = 'http://newsok.com/'), but this media source itself is set to be merged into media source with media_id = 279980. Similarly to the issue with media_id = 0, this would force us do the merging and deleting in a particular order which we won't.
- Address the what seems to be cruft catch-all media source with media_id = 21097.
  
  Weirdly enough, quite a few media sources (2670 to be exact) have a decision to be merged into media source with media_id = 21097 (media_uuid = NULL, media_url = 'http://http:') about them; one can find them for searching for merge(21097) in the action column:
  
  Most of those media sources have invalid URLs, e.g.:
  - http://81.79/
  - http://cran.r-projectorg/
  - http://lizard.home.inr.net)./
  - http://google.comdirgharajprasai/
  - arles-munger-gives-nearly-2-million-to-influence-assembly-races.html#spider
  - view-source:https://www.univision.com/dallas/kuvn
  but some of these URLs are valid too, e.g.:
  - http://gov.mo/
  - http://kharkov.ua/
  - http://pi.gov.br/
  - http://gov.bn/
  - http://64.233.169.104/#spider
  (If Brunei's government websites have domains of DEPARTMENT.gov.bn, e.g. http://moh.gov.bn/, then it's totally valid for Brunei to have a http://gov.bn/ website too.)
  
  I'm guessing that you need to improve your URL validator somewhere. The valid URLs from the list need to stay (skip or merge(...)), and the invalid ones should be deleted (or skipped).
  
  (It's not the same as deleting spam media sources as one has to judge what's spam and what's not; here you can just automatically pick invalid URLs for deletion. However, not deleting those cruft media sources is an option too, you can just skip them, but in any case they shouldn't get merged into 21097.)
- stuff.co.nz from Emily's email:
  
  Stuff.co.nz -- shouldn't be grouped, different publications. We ran into this on a paper involving New Zealand media a couple of years ago.
  
  James, could you add an exception for those? https://www.stuff.co.nz/manawatu-standard/ and https://www.stuff.co.nz/marlborough-express/ do in fact seem to be two different media sources :)
- feedburner.com from Emily's email:
  feeds.feedburner.com -- looks like an amalgamation of a lot of unrelated things, including some of our Robert Wood Johnson work?
  
  Interesting catch! feedburner.com is an external RSS service, those media.url fields point to RSS feeds (e.g. https://feeds.feedburner.com/birthmoviesdeath) instead of the media source URL (e.g. https://birthmoviesdeath.com/) as they should.
  
  James, could you either:
  - For every feeds.feedburner.com URL, figure out which media URL does it point to (e.g. https://birthmoviesdeath.com/ in the example above) and add a fourth option in the "action" to simply update the URL? We could use this action to update a bunch of other URLs too, e.g. fix the "http://example.com:443" entries that we have discussed before; OR
  - Add another exception to your notebook to skip feeds.feedburner.com URLs and leave this problem for future generations? :)
- bizjournals.com from Emily's email:
  
  Bizjournals.com -- shouldn't be grouped, different publications
  
  James, could you add an exception for those?
- buzzsprout.com from Emily's email:
  
  buzzsprout.com -- looks like an aggregator of different sites
  
  James, could you add an exception?
feed_actions:
- What's the reason for deleting some (177) feeds again? Deleting them will lead to stories from those feeds getting deleted too. For example, if a feed like http://mont.ro/alt/rss/index_rss.xml existed in the past, we could have gotten some stories from it. If we were to delete it, we'd have to delete referencing rows from feeds_stories_map table, and thus from stories table too. In the feeds table, just like in media, we should delete either 1) broken URLs that would never have worked, or just 2) nothing.

DB Browser for SQLite makes it really easy to search for arbitrary strings in the SQLite3 database:

Screenshot from 2021-10-06 17-47-12

jtotoole commented 3 years ago

Thanks for the close look!

For example, table has a bunch of duplicates with different decisions about them, e.g. there are two entries with media_id = 43 (media_uuid = 43, media_url = 'http://newsok.com/') - the decision about one of them is to skip it, the other is destined to be merged into media source with media_id = 279980.

Fixed now

Address the what seems to be cruft catch-all media source with media_id = 21097.

Adding a manual step to make sure we skip all these.

buzzsprout, bizjournals, stuff.co.nz, feedburner

Handling these appropriately now.

What's the reason for deleting some (177) feeds again?

We decided not to delete them—refer to table feed_actions for a list just of feeds to be merged into other feeds.

Here's the latest iteration of the database:

https://drive.google.com/file/d/1Z_ACwR8GeA_RDwKJjV8irAPTw9Io-rhP/view?usp=sharing

CSV with actions for the 15k top sources (also emailed to Emily + Fernando):

https://drive.google.com/file/d/1QH1DrIFrTW4fBNjlqTaeLtbXDPBtoLTz/view?usp=sharing

Permalink for latest version of the notebook: https://github.com/mediacloud/media-dedupe-notebook/blob/6e331c0c432ae2bedbbed398ea298572a47b726b/media_dedupe.ipynb

pypt commented 3 years ago

What happened to pamplinmedia.com feeds from #189? For example, feed URL http://www.pamplinmedia.com/component/obrss/portland-tribune belongs to a few sources:

select media_id
from feeds
where url = 'http://www.pamplinmedia.com/component/obrss/portland-tribune'

media_id
--
1391
1392
1394
1396
1397
1399
1400
1401

which were being merged into each other in 20210926 database (at least that was one of the duplicate actions):

select *
from deduped_media_final
where media_id in (1391, 1392, 1394, 1396, 1397, 1399, 1400, 1401)

media_id | media_uuid | media_url | action
-- | -- | -- | --
1391 | beavertonvalleytimes.com | http://beavertonvalleytimes.com/news/index.php | merge(86434)
1391 | beavertonvalleytimes.com | http://beavertonvalleytimes.com/news/index.php | skip
1392 | canbyherald.com | http://canbyherald.com/ | merge(86434)
1392 | canbyherald.com | http://canbyherald.com/ | skip
1394 | estacadanews.com | http://estacadanews.com/news/index.php | merge(86434)
1394 | estacadanews.com | http://estacadanews.com/news/index.php | skip
1396 | lakeoswegoreview.com | http://lakeoswegoreview.com/news/index.php | merge(86434)
1396 | lakeoswegoreview.com | http://lakeoswegoreview.com/news/index.php | skip
1397 | fgnewstimes.com | http://fgnewstimes.com/news/index.php | merge(86434)
1397 | fgnewstimes.com | http://fgnewstimes.com/news/index.php | skip
1399 | sandypost.com | http://sandypost.com/news/index.php | merge(86434)
1399 | sandypost.com | http://sandypost.com/news/index.php | skip
1400 | tigardtimes.com | http://tigardtimes.com/news/index.php | merge(86434)
1400 | tigardtimes.com | http://tigardtimes.com/news/index.php | skip
1401 | westlinntidings.com | http://westlinntidings.com/news/index.php | merge(86434)
1401 | westlinntidings.com | http://westlinntidings.com/news/index.php | skip

but in the 20211008 database they're all being skipped.

(The 20210926 had a deduped_media_final table and 20211008 doesn't. If you're doing some manual tailoring or something for the "final" table, it has to be reflected in the notebook as well.)

jtotoole commented 3 years ago

What happened to pamplinmedia.com feeds?

Ugh sorry, in this most recent effort I forgot to pull in the media duplicates found via the feeds table to override the initial decisions based just on the media table. I've rectified that now by updating the notebook to read the manually created CSV I created manually which identifies duplicate media sources based on data from the feeds table. That manually created CSV has been committed to the media-dedupe-notebook repo, as has the CSV listing feed merges (which appears in the sqlite db as the feed_actions table).

If you're doing some manual tailoring or something for the "final" table, it has to be reflected in the notebook as well.

Sorry, you are correct; added those steps to the notebook.

Latest db: https://drive.google.com/file/d/16mKG-g-YJyAX-iArXrEr13OBZL65KHU1/view?usp=sharing

Notebook permalink: https://github.com/mediacloud/media-dedupe-notebook/blob/b5bda58229712fb215fc3cf116caa1c48d30007e/media_dedupe.ipynb

jtotoole commented 3 years ago

Latest DB file: https://drive.google.com/file/d/1AtHMp9kNlSavD6tOwHCzeyfSZ9jlDZHE/view?usp=sharing

Notebook permalink: https://github.com/mediacloud/media-dedupe-notebook/blob/68e84b8072d0ac93aef4d157e81fa2faa5a68cbd/media_dedupe.ipynb

Main feedback from Emily+Ryan+Fernando related mainly to the names of sources; no indication of sources being incorrectly merged.

pypt commented 3 years ago

Where is dup_feed_urls_merges.csv coming from in feeds_dedupe.ipynb? Is it the same file as media_dups_from_feeds.csv?

(If by any chance you no longer find running the Python notebook convenient and instead went for having a separate set of scripts for generating the actual merge list, then consider getting rid of the notebook in the repo and commit the Python scripts that you're actually using instead. Just guessing here.)

Also, can those xes (of different case - x and X) in delete_feed be booleans (SQLite3 kinda supports those)?

Lastly, how do I see which feed gets merged into what? parent_feed_id is missing from feed_actions table.

jtotoole commented 3 years ago

Where is dup_feed_urls_merges.csv coming from in feeds_dedupe.ipynb? Is it the same file as media_dups_from_feeds.csv?

Sorry, yes—so what's happening there is that feeds_dedupe.ipynb generates the file dup_feed_urls.csv, which comprises media sources that share a common feed URL but haven't been flagged for merging via the media table. From there, I used that CSV for the manual process of identifying duplicate media sources based on share feed URLs. That manually created file is dup_feed_urls_merges.csv, which I then pulled into media_dedupe.ipynb as media_dups_from_feeds.csv (brain fart in not making those filenames the same; I've updated the feeds_dedupe notebook to fix). feed_actions.csv similarly comes via the manual process of identifying feeds that should be merged despite the media sources sharing the feed URL not needing to be merged.

Also, can those xes (of different case - x and X) in delete_feed be booleans?

Given that we decided to skip feed deletion for the time being, I just removed entries from the table.

How do I see which feed gets merged into what? parent_feed_id is missing from feed_actions table.

Updated that table.

Latest DB here: https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing

Permalink to latest iteration of media_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/media_dedupe.ipynb

Permalink to latest iteration of feeds_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/feeds_dedupe.ipynb

pypt commented 3 years ago

Looks good to me now!

So, the final (hopefully) list is:

Latest DB here: https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing Permalink to latest iteration of media_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/media_dedupe.ipynb Permalink to latest iteration of feeds_dedupe.ipynb: https://github.com/mediacloud/media-dedupe-notebook/blob/1c4d42e7832571c5bff97ecbf2056df5324f8ef9/feeds_dedupe.ipynb

Thanks James!