mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Clean up Epoch D URLs submitted with Epoch B page contents #321

Closed philbudne closed 3 months ago

philbudne commented 3 months ago

BLEH! My over the weekend fix to hist-fetcher for epoch handling to allow Dec 26, 2021 (dawn of Epoch D) to be processed removed raising an exception to quarantine the Stories, and the result is that for about 50K stories, the URL from Epoch D was just submitted along with the page contents from Epoch B (earlier in 2021). I have the offending downloads_id numbers, from there we can get the URL from the CSV file. BUT I don't think just deleting the url/story from the Index is the right thing, because if the URL was a dup, the new story (with the wrong text) would have been rejected!!!

philbudne commented 3 months ago

This needs to be done while the current ILM index segments are active (otherwise we could be modifying a "closed" index segment, but maybe that doesn't matter if we're doing incrementals of all ILM index segments)?

philbudne commented 3 months ago

download_ids.txt

pgulley commented 3 months ago

So, we have 50k frankenstein stories that were ingested with the URL from Dec 26, but bodies from a chunk of time earlier in 2021, right? Is the fix to directly edit the stories in elasticsearch, to put the heads back on the correct bodies? Or something else? I'm not sure I fully understand why we shouldn't just delete the offending stories and re-run- are you saying that we would risk introducing dups if we do that?

philbudne commented 3 months ago

So, we have 50k frankenstein stories that were ingested with the URL from Dec 26, but bodies from a chunk of time earlier in 2021, right?

Yes.

But, preliminary weasel words "up to 50k"; It's possible for some of the download_ids there wasn't a file from earlier in 2021 either.

Is the fix to directly edit the stories in elasticsearch, to put the heads back on the correct bodies? Or something else?

Not sure how to go about any kind of "in place" surgery.

I'm not sure I fully understand why we shouldn't just delete the offending stories and re-run- are you saying that we would risk introducing dups if we do that?

Here's my chain of thought:

IF the url had previously been indexed, the new franken-story would have been rejected, leaving the entry OK as is.

AND all the effected URLs are ones for which we don't have an S3 object saved in December 2021, so if it had been previously indexed, we would be losing data.

What I'm thinking is the right thing is:

  1. For each URL, doing a lookup by _id (hash canonical URL)
  2. Reporing if the entry was created on/around 2024-08-05

That would place us in a position to better make a choice on further action (deleting the entry).

philbudne commented 3 months ago

Preliminary numbers:

pbudne@ifill:~/no-epoch$ zcat no-epoch-msgs.gz | wc -l

pbudne@ifill:~/no-epoch$ wc -l no-epoch.csv 
49682 no-epoch.csv

pbudne@ifill:~/no-epoch$ wc -l estest.out 
41820 estest.out

pbudne@ifill:~/no-epoch$ grep 2024-08-06T estest.out | wc -l 
17207

pbudne@ifill:~/no-epoch$ grep -v 2024-08-06T estest.out | wc -l
 24613

So: of 50584 errors, 49682 were found in the 2021-12-26 csv file, and found 41820 of the URLs in ES.

Of those, 17207 were imported on the fateful day, 24613 were not.

Of the 17207 bad entries, here are the top 50 full URL domains:

pbudne@ifill:~/no-epoch$ grep 2024-08-06T estest.out | sed -e 's@^.*//@@' -e 's@/.*$@@' | sort | uniq -c | sort -rn | head -50
   2412 EzineArticles.com
   1276 www.bignewsnetwork.com
    989 www.yjc.news
    368 reports.pr-inside.com
    362 www.rts.rs
    300 www.lecho.be
    296 news.chosun.com
    279 www.tijd.be
    245 www.soychile.cl
    216 thebaynet.com
    199 www.netgazete.com
    144 www.acorianooriental.pt
    143 www.einnews.com
    138 www.mynet.com
    131 www.wam.ae
    127 119.82.71.88
    121 news.hebei.com.cn
    120 www.mk.co.kr
    116 www.belta.by
    115 rtrs.tv:443
    111 world.einnews.com
    108 vnexpress.net
    105 www.pennsylvania.statenews.net
    103 www.realclearpolitics.com
    103 heb.hebei.com.cn
     96 www.wisconsin.statenews.net
     94 www1.folha.uol.com.br
     92 www.diocs.org
     92 house.hebei.com.cn
     91 www.michigan.statenews.net
     86 www.ohio.statenews.net
     81 bhc.hebei.com.cn
     76 www.dainiksaveratimes.com
     75 www.washington.statenews.net
     75 www.tennessee.statenews.net
     74 www.delaware.statenews.net
     74 315.hebei.com.cn
     71 www.indiasnews.net
     71 laconfidentialmag.com
     70 www.southeastasiapost.com
     68 www.tajikistannews.net
     68 house.goo.ne.jp
     65 www.florida.statenews.net
     65 www.abudhabinews.net
     64 www.southeastasianews.net
     64 www.louisiana.statenews.net
     64 www.ktvn.com
     62 www.japanherald.com
     62 www.in.gr
     61 www.dln.com
philbudne commented 3 months ago

I believe I've deleted the 31920 stories that might have indexed epoch B HTML with the wrong URLs (at least trying to delete them again fails)!

The single document delete call has an option to rebuild the index after to expose the change, but I did not see that for the bulk operation I used AND lookup by URL still works for URLs I've tested!