mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

fix dup story sentences performance and deadlocks #28

Closed hroberts closed 8 years ago

hroberts commented 8 years ago

We have 2 problems with the way we handle story_sentences deduping:

After some pondering and poking, I think the best solution to this problem is to:

If we run the above changes, we are still opening ourselves to the possibility of duplicate sentences sneaking through because we're not locking the time between checking for dups and inserting the sentences into ss. We can mitigate this risk by just making sure we do the sentence inserts as quickly as possible after the dup lookup, possibly in a plsql function that accepts a media_id, publish_date, and array of sentences and looks up dups, inserts non dup sentences, and insert the dup story_senences_ids into the story_sentence_first_dups table.

This setup would leave us open to a race condition allowing dups to fall through, but my intuition is that with the universal sentence md5 index the sequence of operations will be fast enough that race condition dups will be rare, which would be fine. The way to test this out would be to run it for some time and check to see how many duplicate sentences we're getting after a couple of days. If we're getting too many dups, we could just stick a lock around the whole plpsql function to make it single threaded. Again, I think that chunk of code would be fast enough that single threading it would not hurt the overall processing time too much.

pypt commented 8 years ago

Thanks for the writeup, your proposed approach seems worth trying to me.

change the story_sentences_dup index to use digest( sentence, 'md5' ) as the functional index to make the index more space efficient than the md5( sentence ) functional index we use now, which stores the md5 as text;

If disk space is a concern, you can SUBSTRING() the binary MD5 hash from 16 bytes to 8, further reducing the index size while still maintaining 2^64 bytes of entropy in the hash.

move story_sentences.is_dup into a separate table that consists of just story_sentences_id and is_dup -- this will allow us to still query this information relatively quickly, since we can do the is_dup lookup as an index only scan, but not require updating the heavily contested story_sentences table; my quick calculations say that this table should require around 30G of space.

With a table-wide (UNIQUE, I presume) index in place, maybe leave is_dup where it is and instead replace COPY with an upsert (either the atomic 9.5 flavor of it or one of the well-documented older approaches)?

-- Try INSERTing all (duplicate and non-duplicate) sentences here
INSERT INTO story_sentences (stories_id, sentence)
VALUES (42, 'Sentence 1.'), (42, 'Sentence 2.'), ...

-- For sentences that would break the UNIQUE constraint...
ON CONFLICT (story_sentences_dup)

-- ...set "is_dup"
DO UPDATE SET is_dup = 't'

-- Return sentences that actually got INSERTed (if Perl code still cares about that)
RETURNING *
;

We would sacrifice some speed (COPY doesn't seem to support upserts) but there wouldn't be any need to do SELECT and INSERT later on.

hroberts commented 8 years ago

A couple of challenges for the unique index / upsert approach:

It would take I think on the order of a week rather than a couple of days (and maybe much longer) to add that unique index to the table.

Also, I had planned on just using an index on the sentence hash and then just filtering for the publish date and media_id, just to save storage space for the index (this is what we do now). Top of my head guess is that the multi column index will be at least twice as big as the single column index.

-hal

pypt commented 8 years ago

In my tests with ~500k sample sentences, the "binary MD5" index took exacly the same amount disk space as "binary MD5 split in half + publish week + media ID" index:

-- Remove duplicates
DELETE FROM story_sentences
WHERE story_sentences_id IN (
    SELECT story_sentences_id
    FROM (
        SELECT story_sentences_id,
               ROW_NUMBER() OVER (
                    partition BY
                        media_id,
                        week_start_date( publish_date::date ),
                        sentence
                    ORDER BY story_sentences_id
                ) AS rnum
        FROM story_sentences) AS t
    WHERE t.rnum > 1
);

CREATE INDEX story_sentences_dup_binary ON story_sentences (
    digest( sentence, 'md5' )
);

CREATE UNIQUE INDEX story_sentences_dup_binary_half_week_media_id ON story_sentences (
    SUBSTRING(digest( sentence, 'md5' ) FROM 0 FOR 9),  -- 1/2 of sentence's MD5 hash
    week_start_date( publish_date::date ),
    media_id
);

VACUUM FULL ANALYZE;

SELECT nspname || '.' || relname AS relation,
       pg_size_pretty(pg_relation_size(c.oid)) AS size
FROM pg_class AS c
    LEFT JOIN pg_namespace AS n ON (n.oid = c.relnamespace)
WHERE relname = 'story_sentences' OR relname LIKE 'story_sentences_dup%'
ORDER BY pg_relation_size(c.oid) DESC
;

                       relation                       |  size  
------------------------------------------------------+--------
 public.story_sentences                               | 103 MB
 public.story_sentences_dup_binary_half_week_media_id | 20 MB
 public.story_sentences_dup_binary                    | 20 MB
(3 rows)

If we assume that the index will grow in linear fashion (I'm not sure if it will), both indexes should take up 220 GB of disk space with the current production data.

The "binary MD5 split in half + publish week + media ID" index allows us to do the 9.5 upsert trick as follows:

INSERT INTO story_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language)
    -- Last sentence in the table to trigger the constraint conflict
    SELECT stories_id, sentence_number, sentence, media_id, publish_date, language
    FROM story_sentences
    ORDER BY story_sentences_id DESC
    LIMIT 1
ON CONFLICT (
    -- For whatever reason database doesn't take "story_sentences_dup_binary_half_week_media_id"
    -- for a constraint name so we have to put in the whole index expression
    SUBSTRING(digest(sentence, 'md5'::text), 0, 9),
    week_start_date(publish_date::date),
    media_id
)
DO UPDATE SET is_dup = 't'
;

I'm not sure about what to do with the time it would take to create said unique index though. It's also unclear about how this would work if we partitioned the story_sentences table (unique constraints / indexes can be added for individual partitions of the table only, not for the whole partitioned table).

pypt commented 8 years ago

After a second thought (considering future partitioning ambitions and such), I think your original proposed approach (system-wide non-unique index on md5(sentence), maybe moving is_dup to another table) would work better.

I'd only further reduce the index size by indexing only half of the MD5 hash:

CREATE FUNCTION half_md5 (string TEXT) RETURNS bytea AS $$
    SELECT SUBSTRING(digest(string, 'md5'::text), 0, 9);
$$ LANGUAGE SQL;

CREATE INDEX story_sentences_sentence_half_md5 ON story_sentences (half_md5(sentence));

-- Still race condition-prone, but UPDATE and subsequent INSERT are tightly
-- coupled, so maybe the number of duplicate sentences that get into the
-- table won't be that big?
WITH new_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language) AS (VALUES
    (4549, 1, 'Foo.', 5, NOW(), 'en'),
    (4549, 2, 'Bar.', 5, NOW(), 'en'),
    (4549, 2, 'Baz.', 5, NOW(), 'en')
    -- , ...
),
duplicate_sentences AS (
    UPDATE story_sentences
    SET is_dup = 't'
        -- , disable_triggers = 't'
    FROM new_sentences
    WHERE half_md5(story_sentences.sentence) = half_md5(new_sentences.sentence)
      AND week_start_date( story_sentences.publish_date::date ) = week_start_date( new_sentences.publish_date::date )
      AND story_sentences.media_id = new_sentences.media_id
    RETURNING story_sentences.sentence
)
INSERT INTO story_sentences (stories_id, sentence_number, sentence, media_id, publish_date, language)
SELECT stories_id, sentence_number, sentence, media_id, publish_date, language
FROM new_sentences
WHERE sentence NOT IN (
    SELECT sentence
    FROM duplicate_sentences
)
;
hroberts commented 8 years ago

Great work on the half md5. The smaller index should make the lookups faster as well.

I agree we should go with this approach.

-hal

pypt commented 8 years ago

So I'll go ahead and start creating half_md5(sentence) index concurrently on mcdb1? Should take some time to finish, we could do the appropriate code changes meanwhile.

hroberts commented 8 years ago

yes.

I would give session at least 8G of memory to give it some hope of completing in less than a week.

-hal

pypt commented 8 years ago

Will the background CREATE INDEX CONCURRENTLY care about work_mem setting?

I've tried creating the index on Saturday morning, but it just locked out the table for 10 minutes and didn't return, so I figured I won't ruin my weekend and Ctrl+C'ed it. I'll try again today.

pypt commented 8 years ago

Started creating the half_md5(sentence) index. CREATE INDEX (CONCURRENTLY) doesn't seem to be blocking anything.

hroberts commented 8 years ago

great. I've run very long concurrent index creations without ever having blocking problems. I've given up on some after weeks, though. I think sometimes it is slow enough that it just can't ever catch up. But the only reasonable alternative is to shut down extraction for a few days while we create the new index, so it's worth a try.

-hal

Started creating the |half_md5(sentence)| index. |CREATE INDEX (CONCURRENTLY)| doesn't seem to be blocking anything.

pypt commented 8 years ago

Woo!

mediacloud=# CREATE INDEX CONCURRENTLY story_sentences_sentence_half_md5 ON story_sentences (half_md5(sentence));
CREATE INDEX
Time: 124499632.685 ms

Strangely enough, CREATE INDEX only took one and a half days and used up 173 GB.

I'll go rewrite the story insertion code to use the new index then.

hroberts commented 8 years ago

great! should be pretty trivial -- you'll just have to replace the perl logic that picks the query form based on the age of the story with a simple single query.

The small size of the index is terrific. Small size == better performance as well as less space use.

-hal

pypt commented 8 years ago

Well, the approach didn't work, deadlocks are still there. They're even pretty easy to replicate on a laptop with a sample dump - just start RabbitMQ, add a lot of extraction jobs to the queue and run multiple extractor workers.

Here's a sample (reasons why the deadlock happens vary slightly): https://gist.github.com/pypt/f2c3791606f0954ef35a707842f43f31

I think I'll go try getting rid of disable_triggers and moving is_dup to a separate table, see what happens.

pypt commented 8 years ago

Increased deadlock_timeout from the default of 1s to 5s on my local machine, tried hitting the extractor hard - not a single deadlock.

Maybe that's the answer? With the new table-wide index, checking for duplicates in old stories should be way faster but still slow due to table size and a number of extractors, so we just have to make the server wait a little longer before pronouncing a deadlock?

hroberts commented 8 years ago

huh!

I thought the setting that controlled this was a different one. Yes, we should set this much higher, on the order of 10 seconds on mcdb1.

I have done so:


mediacloud=# show deadlock_timeout mediacloud-# ;

deadlock_timeout

10s (1 row)


-hal

pypt commented 8 years ago

As discussed on Slack, increasing deadlock_timeout and the new index didn't help much. The strange part is that the deadlock only happens when extracting old stories set up for reextraction, not the new ones which just came in from the crawler.

Another bizarre thing is that I have tested the new query on my local machine (running 9.5) somewhat extensibly and didn't hit a single deadlock. So, is it possible that 9.5 has some improvements over this whole deadlock thing?

Here's a sample deadlocking query and EXPLAIN ANALYZE (output of EXPLAIN ANALYZE is not from the the sample query). As previously, database spends the most time waiting at UPDATE (and usually deadlocks).

I've considered reverting to the previous implementation but didn't see it as a good option because it had pretty much identical problems. Moving is_dup to a separate table is probably a way out in the long term, but would involve lengthy structural changes of story_sentences table.

So, as a temporary measure, I have added an advisory lock on media_id around the "UPDATE-then-INSERT" query and made the reextraction script add stories to the queue in random order to disperse the distribution of media_ids.

Advantages of manual advisory locks:

  1. Works. No deadlocks for at least two hours, both new and old stories are being extracted normally (new ones without much of a delay, old ones usually locking out on media_id = 26 (Sacramento Bee?), probably because it has many identical sentences in each story)

Disadvantages:

  1. Still not a perfect solution; we should be able to INSERT / UPDATE anywhere in the table anytime.
  2. Slows down the processing of stories with the same media_id.
  3. General "unknown unknowns" with manual locks.

I think I'll leave this advisory lock for the time being until we figure out how to gracefully move is_dup to a new table.

hroberts commented 8 years ago

The advisory locks seem like a good solution, even for the long term. I didn't know postgres could do that.

I'm okay with restricting threading to the media_id on an ongoing basis.

The main issue I think is figuring out why the update is still so slow for some of the old queries, even with the new index. Maybe stick a debut statement that prints out the query text for any of the queries that take more than a minute to run?

Also, looking at the current query list, I see a bunch of queries waiting for advisory locks, but I don't see any of the update queries. I don't understand how that could be the case.

We could make the advisory locking more robust by wrapping it in a function that assigns classes to the advisory locks. If we tried this in another context in the code right now, the media_ids used in this case might conflict with whatever id system is used for the other case.

-hal

pypt commented 8 years ago

Stories are now getting (re)extracted at 30 stories/s rate which I'd consider to be pretty fast.

As mentioned in #39, stories with media_id = 26 were holding advisory locks for a long time and making other extractors to wait for those locks to be released. The EXPLAIN ANALYZE of "UPDATE-then-INSERT" query for story with media_id = 26 is a little bit different, the query (sample) spends most of the time looking for duplicate sentences and filtering them out by date range, not UPDATEing them afterwards. I'll try to improve week_start_date() to look for a constant publish_date of a story, maybe that will improve the speed. For the time being, I have moved stories with media_id = 26 to a separate table and will reextract them later.

Also, looking at the current query list, I see a bunch of queries waiting for advisory locks, but I don't see any of the update queries. I don't understand how that could be the case.

UPDATEs are pretty fast for most of the stories now because somehow the advisory lock on media_id prevents them from locking each other out.

We could make the advisory locking more robust by wrapping it in a function that assigns classes to the advisory locks. If we tried this in another context in the code right now, the media_ids used in this case might conflict with whatever id system is used for the other case.

That's another disadvantage with advisory locks: 4. As the lock is being set not on an ID or a string of some sort but on an arbitrary number, you can pretty much use them in a single part of code without reinventing your own locking logic, getting into creative arithmetics, coming up with unmaintainable code and gradually going mad. Those locks seem to be working fine for the time being, but in the long term we should be moving away from them by trying to move is_dup (or anything else that needs to be UPDATEd) to a separate table.

hroberts commented 8 years ago

On 7/8/16 2:15 PM, Linas Valiukas wrote:

Stories are now getting (re)extracted at 30 stories/s rate which I'd consider to be pretty fast.

This seems good to me, especially including the historical stuff.

Also, looking at the current query list, I see a bunch of queries
waiting for advisory locks, but I don't see any of the update queries.
I don't understand how that could be the case.

|UPDATE|s are pretty fast for most of the stories now because somehow the advisory lock on |media_id| prevents them from locking each other out.

It's fine to leave this be, but on principle, if the select for the lock is there, the update should be in the list, no? As soon as the update finishes, a lock query should disappear and start another update.

That's another disadvantage with advisory locks: 4. As the lock is being set not on an ID or a string of some sort but on an arbitrary number, you can pretty much use them in a single part of code without reinventing your own locking logic, getting into creative arithmetics, coming up with unmaintainable code and gradually going mad.

I'm not as worried about this. We can just make a practice of using some intermediate helper method in the db object, and if we forget the only downside is some more unnecessary waiting.

Those locks seem to be working fine for the time being, but in the long term we should be moving away from them by trying to move |is_dup| (or anything else that needs to be |UPDATE|d) to a separate table.

I agree that the ideal would be to have no locking and no big complex updates.

-hal

Hal Roberts Fellow Berkman Center for Internet & Society Harvard University

pypt commented 8 years ago

I've remade week_start_date() condition into a constant (e3c4cb9) and dropped the old story_sentences_dup index. As for the idea of moving is_dup to a separate table idea, I've added it to my "someday, unless it causes immediate problems" list in my profanity-ridden Trello.

Closing then.

It's fine to leave this be, but on principle, if the select for the lock is there, the update should be in the list, no? As soon as the update finishes, a lock query should disappear and start another update.

I suppose that due to manual locking, UPDATE doesn't have to wait for anything to do its job so it's not in the non-granted locks list anymore.