Closed philbudne closed 3 months ago
This needs to be done while the current ILM index segments are active (otherwise we could be modifying a "closed" index segment, but maybe that doesn't matter if we're doing incrementals of all ILM index segments)?
So, we have 50k frankenstein stories that were ingested with the URL from Dec 26, but bodies from a chunk of time earlier in 2021, right? Is the fix to directly edit the stories in elasticsearch, to put the heads back on the correct bodies? Or something else? I'm not sure I fully understand why we shouldn't just delete the offending stories and re-run- are you saying that we would risk introducing dups if we do that?
So, we have 50k frankenstein stories that were ingested with the URL from Dec 26, but bodies from a chunk of time earlier in 2021, right?
Yes.
But, preliminary weasel words "up to 50k"; It's possible for some of the download_ids there wasn't a file from earlier in 2021 either.
Is the fix to directly edit the stories in elasticsearch, to put the heads back on the correct bodies? Or something else?
Not sure how to go about any kind of "in place" surgery.
I'm not sure I fully understand why we shouldn't just delete the offending stories and re-run- are you saying that we would risk introducing dups if we do that?
Here's my chain of thought:
IF the url had previously been indexed, the new franken-story would have been rejected, leaving the entry OK as is.
AND all the effected URLs are ones for which we don't have an S3 object saved in December 2021, so if it had been previously indexed, we would be losing data.
What I'm thinking is the right thing is:
That would place us in a position to better make a choice on further action (deleting the entry).
Preliminary numbers:
pbudne@ifill:~/no-epoch$ zcat no-epoch-msgs.gz | wc -l
pbudne@ifill:~/no-epoch$ wc -l no-epoch.csv
49682 no-epoch.csv
pbudne@ifill:~/no-epoch$ wc -l estest.out
41820 estest.out
pbudne@ifill:~/no-epoch$ grep 2024-08-06T estest.out | wc -l
17207
pbudne@ifill:~/no-epoch$ grep -v 2024-08-06T estest.out | wc -l
24613
So: of 50584 errors, 49682 were found in the 2021-12-26 csv file, and found 41820 of the URLs in ES.
Of those, 17207 were imported on the fateful day, 24613 were not.
Of the 17207 bad entries, here are the top 50 full URL domains:
pbudne@ifill:~/no-epoch$ grep 2024-08-06T estest.out | sed -e 's@^.*//@@' -e 's@/.*$@@' | sort | uniq -c | sort -rn | head -50
2412 EzineArticles.com
1276 www.bignewsnetwork.com
989 www.yjc.news
368 reports.pr-inside.com
362 www.rts.rs
300 www.lecho.be
296 news.chosun.com
279 www.tijd.be
245 www.soychile.cl
216 thebaynet.com
199 www.netgazete.com
144 www.acorianooriental.pt
143 www.einnews.com
138 www.mynet.com
131 www.wam.ae
127 119.82.71.88
121 news.hebei.com.cn
120 www.mk.co.kr
116 www.belta.by
115 rtrs.tv:443
111 world.einnews.com
108 vnexpress.net
105 www.pennsylvania.statenews.net
103 www.realclearpolitics.com
103 heb.hebei.com.cn
96 www.wisconsin.statenews.net
94 www1.folha.uol.com.br
92 www.diocs.org
92 house.hebei.com.cn
91 www.michigan.statenews.net
86 www.ohio.statenews.net
81 bhc.hebei.com.cn
76 www.dainiksaveratimes.com
75 www.washington.statenews.net
75 www.tennessee.statenews.net
74 www.delaware.statenews.net
74 315.hebei.com.cn
71 www.indiasnews.net
71 laconfidentialmag.com
70 www.southeastasiapost.com
68 www.tajikistannews.net
68 house.goo.ne.jp
65 www.florida.statenews.net
65 www.abudhabinews.net
64 www.southeastasianews.net
64 www.louisiana.statenews.net
64 www.ktvn.com
62 www.japanherald.com
62 www.in.gr
61 www.dln.com
I believe I've deleted the 31920 stories that might have indexed epoch B HTML with the wrong URLs (at least trying to delete them again fails)!
The single document delete call has an option to rebuild the index after to expose the change, but I did not see that for the bulk operation I used AND lookup by URL still works for URLs I've tested!
BLEH! My over the weekend fix to hist-fetcher for epoch handling to allow Dec 26, 2021 (dawn of Epoch D) to be processed removed raising an exception to quarantine the Stories, and the result is that for about 50K stories, the URL from Epoch D was just submitted along with the page contents from Epoch B (earlier in 2021). I have the offending downloads_id numbers, from there we can get the URL from the CSV file. BUT I don't think just deleting the url/story from the Index is the right thing, because if the URL was a dup, the new story (with the wrong text) would have been rejected!!!