Closed hroberts closed 8 years ago
Thanks for the writeup, your proposed approach seems worth trying to me.
change the story_sentences_dup index to use digest( sentence, 'md5' ) as the functional index to make the index more space efficient than the md5( sentence ) functional index we use now, which stores the md5 as text;
If disk space is a concern, you can SUBSTRING()
the binary MD5 hash from 16 bytes to 8, further reducing the index size while still maintaining 2^64 bytes of entropy in the hash.
move story_sentences.is_dup into a separate table that consists of just story_sentences_id and is_dup -- this will allow us to still query this information relatively quickly, since we can do the is_dup lookup as an index only scan, but not require updating the heavily contested story_sentences table; my quick calculations say that this table should require around 30G of space.
With a table-wide (UNIQUE
, I presume) index in place, maybe leave is_dup
where it is and instead replace COPY
with an upsert (either the atomic 9.5 flavor of it or one of the well-documented older approaches)?
-- Try INSERTing all (duplicate and non-duplicate) sentences here
INSERT INTO story_sentences (stories_id, sentence)
VALUES (42, 'Sentence 1.'), (42, 'Sentence 2.'), ...
-- For sentences that would break the UNIQUE constraint...
ON CONFLICT (story_sentences_dup)
-- ...set "is_dup"
DO UPDATE SET is_dup = 't'
-- Return sentences that actually got INSERTed (if Perl code still cares about that)
RETURNING *
;
We would sacrifice some speed (COPY
doesn't seem to support upserts) but there wouldn't be any need to do SELECT
and INSERT
later on.
A couple of challenges for the unique index / upsert approach:
It would take I think on the order of a week rather than a couple of days (and maybe much longer) to add that unique index to the table.
Also, I had planned on just using an index on the sentence hash and then just filtering for the publish date and media_id, just to save storage space for the index (this is what we do now). Top of my head guess is that the multi column index will be at least twice as big as the single column index.
-hal
In my tests with ~500k sample sentences, the "binary MD5" index took exacly the same amount disk space as "binary MD5 split in half + publish week + media ID" index:
-- Remove duplicates
DELETE FROM story_sentences
WHERE story_sentences_id IN (
SELECT story_sentences_id
FROM (
SELECT story_sentences_id,
ROW_NUMBER() OVER (
partition BY
media_id,
week_start_date( publish_date::date ),
sentence
ORDER BY story_sentences_id
) AS rnum
FROM story_sentences) AS t
WHERE t.rnum > 1
);
CREATE INDEX story_sentences_dup_binary ON story_sentences (
digest( sentence, 'md5' )
);
CREATE UNIQUE INDEX story_sentences_dup_binary_half_week_media_id ON story_sentences (
SUBSTRING(digest( sentence, 'md5' ) FROM 0 FOR 9), -- 1/2 of sentence's MD5 hash
week_start_date( publish_date::date ),
media_id
);
VACUUM FULL ANALYZE;
SELECT nspname || '.' || relname AS relation,
pg_size_pretty(pg_relation_size(c.oid)) AS size
FROM pg_class AS c
LEFT JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
WHERE relname = 'story_sentences' OR relname LIKE 'story_sentences_dup%'
ORDER BY pg_relation_size(c.oid) DESC
;
relation | size
------------------------------------------------------+--------
public.story_sentences | 103 MB
public.story_sentences_dup_binary_half_week_media_id | 20 MB
public.story_sentences_dup_binary | 20 MB
(3 rows)
If we assume that the index will grow in linear fashion (I'm not sure if it will), both indexes should take up 220 GB of disk space with the current production data.
The "binary MD5 split in half + publish week + media ID" index allows us to do the 9.5 upsert trick as follows:
INSERT INTO story_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language)
-- Last sentence in the table to trigger the constraint conflict
SELECT stories_id, sentence_number, sentence, media_id, publish_date, language
FROM story_sentences
ORDER BY story_sentences_id DESC
LIMIT 1
ON CONFLICT (
-- For whatever reason database doesn't take "story_sentences_dup_binary_half_week_media_id"
-- for a constraint name so we have to put in the whole index expression
SUBSTRING(digest(sentence, 'md5'::text), 0, 9),
week_start_date(publish_date::date),
media_id
)
DO UPDATE SET is_dup = 't'
;
I'm not sure about what to do with the time it would take to create said unique index though. It's also unclear about how this would work if we partitioned the story_sentences
table (unique constraints / indexes can be added for individual partitions of the table only, not for the whole partitioned table).
After a second thought (considering future partitioning ambitions and such), I think your original proposed approach (system-wide non-unique index on md5(sentence)
, maybe moving is_dup
to another table) would work better.
I'd only further reduce the index size by indexing only half of the MD5 hash:
CREATE FUNCTION half_md5 (string TEXT) RETURNS bytea AS $$
SELECT SUBSTRING(digest(string, 'md5'::text), 0, 9);
$$ LANGUAGE SQL;
CREATE INDEX story_sentences_sentence_half_md5 ON story_sentences (half_md5(sentence));
-- Still race condition-prone, but UPDATE and subsequent INSERT are tightly
-- coupled, so maybe the number of duplicate sentences that get into the
-- table won't be that big?
WITH new_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language) AS (VALUES
(4549, 1, 'Foo.', 5, NOW(), 'en'),
(4549, 2, 'Bar.', 5, NOW(), 'en'),
(4549, 2, 'Baz.', 5, NOW(), 'en')
-- , ...
),
duplicate_sentences AS (
UPDATE story_sentences
SET is_dup = 't'
-- , disable_triggers = 't'
FROM new_sentences
WHERE half_md5(story_sentences.sentence) = half_md5(new_sentences.sentence)
AND week_start_date( story_sentences.publish_date::date ) = week_start_date( new_sentences.publish_date::date )
AND story_sentences.media_id = new_sentences.media_id
RETURNING story_sentences.sentence
)
INSERT INTO story_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language)
SELECT stories_id, sentence_number, sentence, media_id, publish_date, language
FROM new_sentences
WHERE sentence NOT IN (
SELECT sentence
FROM duplicate_sentences
)
;
Great work on the half md5. The smaller index should make the lookups faster as well.
I agree we should go with this approach.
-hal
So I'll go ahead and start creating half_md5(sentence)
index concurrently on mcdb1? Should take some time to finish, we could do the appropriate code changes meanwhile.
yes.
I would give session at least 8G of memory to give it some hope of completing in less than a week.
-hal
Will the background CREATE INDEX CONCURRENTLY
care about work_mem
setting?
I've tried creating the index on Saturday morning, but it just locked out the table for 10 minutes and didn't return, so I figured I won't ruin my weekend and Ctrl+C'ed it. I'll try again today.
Started creating the half_md5(sentence)
index. CREATE INDEX (CONCURRENTLY)
doesn't seem to be blocking anything.
great. I've run very long concurrent index creations without ever having blocking problems. I've given up on some after weeks, though. I think sometimes it is slow enough that it just can't ever catch up. But the only reasonable alternative is to shut down extraction for a few days while we create the new index, so it's worth a try.
-hal
Started creating the |half_md5(sentence)| index. |CREATE INDEX (CONCURRENTLY)| doesn't seem to be blocking anything.
Woo!
mediacloud=# CREATE INDEX CONCURRENTLY story_sentences_sentence_half_md5 ON story_sentences (half_md5(sentence));
CREATE INDEX
Time: 124499632.685 ms
Strangely enough, CREATE INDEX
only took one and a half days and used up 173 GB.
I'll go rewrite the story insertion code to use the new index then.
great! should be pretty trivial -- you'll just have to replace the perl logic that picks the query form based on the age of the story with a simple single query.
The small size of the index is terrific. Small size == better performance as well as less space use.
-hal
Well, the approach didn't work, deadlocks are still there. They're even pretty easy to replicate on a laptop with a sample dump - just start RabbitMQ, add a lot of extraction jobs to the queue and run multiple extractor workers.
Here's a sample (reasons why the deadlock happens vary slightly): https://gist.github.com/pypt/f2c3791606f0954ef35a707842f43f31
I think I'll go try getting rid of disable_triggers
and moving is_dup
to a separate table, see what happens.
Increased deadlock_timeout
from the default of 1s
to 5s
on my local machine, tried hitting the extractor hard - not a single deadlock.
Maybe that's the answer? With the new table-wide index, checking for duplicates in old stories should be way faster but still slow due to table size and a number of extractors, so we just have to make the server wait a little longer before pronouncing a deadlock?
huh!
I thought the setting that controlled this was a different one. Yes, we should set this much higher, on the order of 10 seconds on mcdb1.
I have done so:
mediacloud=# show deadlock_timeout mediacloud-# ;
10s (1 row)
-hal
As discussed on Slack, increasing deadlock_timeout
and the new index didn't help much. The strange part is that the deadlock only happens when extracting old stories set up for reextraction, not the new ones which just came in from the crawler.
Another bizarre thing is that I have tested the new query on my local machine (running 9.5) somewhat extensibly and didn't hit a single deadlock. So, is it possible that 9.5 has some improvements over this whole deadlock thing?
Here's a sample deadlocking query and EXPLAIN ANALYZE
(output of EXPLAIN ANALYZE
is not from the the sample query). As previously, database spends the most time waiting at UPDATE
(and usually deadlocks).
I've considered reverting to the previous implementation but didn't see it as a good option because it had pretty much identical problems. Moving is_dup
to a separate table is probably a way out in the long term, but would involve lengthy structural changes of story_sentences
table.
So, as a temporary measure, I have added an advisory lock on media_id
around the "UPDATE-then-INSERT" query and made the reextraction script add stories to the queue in random order to disperse the distribution of media_id
s.
Advantages of manual advisory locks:
media_id = 26
(Sacramento Bee?), probably because it has many identical sentences in each story)Disadvantages:
media_id
.I think I'll leave this advisory lock for the time being until we figure out how to gracefully move is_dup
to a new table.
The advisory locks seem like a good solution, even for the long term. I didn't know postgres could do that.
I'm okay with restricting threading to the media_id on an ongoing basis.
The main issue I think is figuring out why the update is still so slow for some of the old queries, even with the new index. Maybe stick a debut statement that prints out the query text for any of the queries that take more than a minute to run?
Also, looking at the current query list, I see a bunch of queries waiting for advisory locks, but I don't see any of the update queries. I don't understand how that could be the case.
We could make the advisory locking more robust by wrapping it in a function that assigns classes to the advisory locks. If we tried this in another context in the code right now, the media_ids used in this case might conflict with whatever id system is used for the other case.
-hal
Stories are now getting (re)extracted at 30 stories/s rate which I'd consider to be pretty fast.
As mentioned in #39, stories with media_id = 26
were holding advisory locks for a long time and making other extractors to wait for those locks to be released. The EXPLAIN ANALYZE
of "UPDATE-then-INSERT" query for story with media_id = 26
is a little bit different, the query (sample) spends most of the time looking for duplicate sentences and filtering them out by date range, not UPDATE
ing them afterwards. I'll try to improve week_start_date()
to look for a constant publish_date
of a story, maybe that will improve the speed. For the time being, I have moved stories with media_id = 26
to a separate table and will reextract them later.
Also, looking at the current query list, I see a bunch of queries waiting for advisory locks, but I don't see any of the update queries. I don't understand how that could be the case.
UPDATE
s are pretty fast for most of the stories now because somehow the advisory lock on media_id
prevents them from locking each other out.
We could make the advisory locking more robust by wrapping it in a function that assigns classes to the advisory locks. If we tried this in another context in the code right now, the media_ids used in this case might conflict with whatever id system is used for the other case.
That's another disadvantage with advisory locks: 4. As the lock is being set not on an ID or a string of some sort but on an arbitrary number, you can pretty much use them in a single part of code without reinventing your own locking logic, getting into creative arithmetics, coming up with unmaintainable code and gradually going mad. Those locks seem to be working fine for the time being, but in the long term we should be moving away from them by trying to move is_dup
(or anything else that needs to be UPDATE
d) to a separate table.
On 7/8/16 2:15 PM, Linas Valiukas wrote:
Stories are now getting (re)extracted at 30 stories/s rate which I'd consider to be pretty fast.
This seems good to me, especially including the historical stuff.
Also, looking at the current query list, I see a bunch of queries waiting for advisory locks, but I don't see any of the update queries. I don't understand how that could be the case.
|UPDATE|s are pretty fast for most of the stories now because somehow the advisory lock on |media_id| prevents them from locking each other out.
It's fine to leave this be, but on principle, if the select for the lock is there, the update should be in the list, no? As soon as the update finishes, a lock query should disappear and start another update.
That's another disadvantage with advisory locks: 4. As the lock is being set not on an ID or a string of some sort but on an arbitrary number, you can pretty much use them in a single part of code without reinventing your own locking logic, getting into creative arithmetics, coming up with unmaintainable code and gradually going mad.
I'm not as worried about this. We can just make a practice of using some intermediate helper method in the db object, and if we forget the only downside is some more unnecessary waiting.
Those locks seem to be working fine for the time being, but in the long term we should be moving away from them by trying to move |is_dup| (or anything else that needs to be |UPDATE|d) to a separate table.
I agree that the ideal would be to have no locking and no big complex updates.
-hal
Hal Roberts Fellow Berkman Center for Internet & Society Harvard University
I've remade week_start_date()
condition into a constant (e3c4cb9) and dropped the old story_sentences_dup
index. As for the idea of moving is_dup
to a separate table idea, I've added it to my "someday, unless it causes immediate problems" list in my profanity-ridden Trello.
Closing then.
It's fine to leave this be, but on principle, if the select for the lock is there, the update should be in the list, no? As soon as the update finishes, a lock query should disappear and start another update.
I suppose that due to manual locking, UPDATE
doesn't have to wait for anything to do its job so it's not in the non-granted locks list anymore.
We have 2 problems with the way we handle story_sentences deduping:
After some pondering and poking, I think the best solution to this problem is to:
If we run the above changes, we are still opening ourselves to the possibility of duplicate sentences sneaking through because we're not locking the time between checking for dups and inserting the sentences into ss. We can mitigate this risk by just making sure we do the sentence inserts as quickly as possible after the dup lookup, possibly in a plsql function that accepts a media_id, publish_date, and array of sentences and looks up dups, inserts non dup sentences, and insert the dup story_senences_ids into the story_sentence_first_dups table.
This setup would leave us open to a race condition allowing dups to fall through, but my intuition is that with the universal sentence md5 index the sequence of operations will be fast enough that race condition dups will be rare, which would be fine. The way to test this out would be to run it for some time and check to see how many duplicate sentences we're getting after a couple of days. If we're getting too many dups, we could just stick a lock around the whole plpsql function to make it single threaded. Again, I think that chunk of code would be fast enough that single threading it would not hurt the overall processing time too much.