Make search for documents faster for large dataset

archiloque commented 1 year ago

We use Teedy to store lots of documents (currently 3.5 millions) and the query that search for documents has become very slow on PostgreSQL (between 5 and 10 seconds on my machine). It means a bad expercience when the API is called synchronously by the users, and it makes some batch tools we use very slow.

This is the query:

SELECT COUNT(*) AS RESULT_COUNT
FROM
    (SELECT DISTINCT D.DOC_ID_C C0,
            D.DOC_TITLE_C C1,
            D.DOC_DESCRIPTION_C C2,
            D.DOC_CREATEDATE_D C3,
            D.DOC_LANGUAGE_C C4,
            D.DOC_IDFILE_C,
            S.COUNT C5,
            F.COUNT C6,
            RS2.RTP_ID_C C7,
            RS2.RTP_NAME_C,
            D.DOC_UPDATEDATE_D C8
        FROM T_DOCUMENT D
        LEFT JOIN
            (SELECT COUNT(S.SHA_ID_C) COUNT,
                    AC.ACL_SOURCEID_C
                FROM T_SHARE S,
                    T_ACL AC
                WHERE AC.ACL_TARGETID_C = S.SHA_ID_C
                    AND AC.ACL_DELETEDATE_D IS NULL
                    AND S.SHA_DELETEDATE_D IS NULL
                GROUP BY AC.ACL_SOURCEID_C) S ON S.ACL_SOURCEID_C = D.DOC_ID_C
        LEFT JOIN
            (SELECT COUNT(F.FIL_ID_C) COUNT,
                    F.FIL_IDDOC_C
                FROM T_FILE F
                WHERE F.FIL_DELETEDATE_D IS NULL
                GROUP BY F.FIL_IDDOC_C) F ON F.FIL_IDDOC_C = D.DOC_ID_C
        LEFT JOIN
            (SELECT RS.*,
                    RS3.IDDOCUMENT
                FROM T_ROUTE_STEP RS
                JOIN
                    (SELECT R.RTE_IDDOCUMENT_C IDDOCUMENT,
                            RS.RTP_IDROUTE_C IDROUTE,
                            MIN(RS.RTP_ORDER_N) MINORDER
                        FROM T_ROUTE_STEP RS
                        JOIN T_ROUTE R ON R.RTE_ID_C = RS.RTP_IDROUTE_C
                        AND R.RTE_DELETEDATE_D IS NULL
                        WHERE RS.RTP_DELETEDATE_D IS NULL
                            AND RS.RTP_ENDDATE_D IS NULL
                        GROUP BY RS.RTP_IDROUTE_C,
                            R.RTE_IDDOCUMENT_C) RS3 ON RS.RTP_IDROUTE_C = RS3.IDROUTE
                AND RS.RTP_ORDER_N = RS3.MINORDER
                WHERE RS.RTP_IDTARGET_C IN ('administrators', 'f0c9b0b9-b624-4400-8cd5-84d5ec861bb6')) RS2 ON RS2.IDDOCUMENT = D.DOC_ID_C
        LEFT JOIN T_DOCUMENT_TAG DT0 ON DT0.DOT_IDDOCUMENT_C = D.DOC_ID_C
        AND DT0.DOT_IDTAG_C = '32b108c8-86a6-436f-9b58-9085b2385429'
        AND DT0.DOT_DELETEDATE_D IS NULL
        WHERE D.DOC_TITLE_C IN ('plop')
            AND (DT0.DOT_ID_C IS NOT NULL)
            AND D.DOC_DELETEDATE_D IS NULL
        ORDER BY C0 DESC) AS T1

This is the query plan.

Aggregate  (cost=1032120.09..1032120.10 rows=1 width=8)
  ->  Unique  (cost=1032120.05..1032120.08 rows=1 width=1160)
        ->  Sort  (cost=1032120.05..1032120.06 rows=1 width=1160)
              Sort Key: d.doc_id_c DESC, d.doc_description_c, d.doc_createdate_d, d.doc_language_c, d.doc_idfile_c, (count(s.sha_id_c)), (count(f.fil_id_c)), rs.rtp_id_c, rs.rtp_name_c, d.doc_updatedate_d
              ->  Nested Loop  (cost=575142.93..1032120.04 rows=1 width=1160)
                    ->  Nested Loop Left Join  (cost=575142.38..1032111.45 rows=1 width=1160)
                          Join Filter: ((r.rte_iddocument_c)::text = (d.doc_id_c)::text)
                          ->  Merge Left Join  (cost=575119.76..1032078.38 rows=1 width=652)
                                Merge Cond: ((d.doc_id_c)::text = (f.fil_iddoc_c)::text)
                                ->  Sort  (cost=179814.34..179814.35 rows=1 width=644)
                                      Sort Key: d.doc_id_c
                                      ->  Merge Left Join  (cost=174739.24..179814.33 rows=1 width=644)
                                            Merge Cond: ((d.doc_id_c)::text = (ac.acl_sourceid_c)::text)
                                            ->  Sort  (cost=8.58..8.59 rows=1 width=636)
                                                  Sort Key: d.doc_id_c
                                                  ->  Index Scan using idx_doc_title_c on t_document d  (cost=0.56..8.57 rows=1 width=636)
                                                        Index Cond: ((doc_title_c)::text = 'SubscriptionContract-328424'::text)
                                                        Filter: (doc_deletedate_d IS NULL)
                                            ->  Finalize GroupAggregate  (cost=174730.66..179318.28 rows=38996 width=45)
                                                  Group Key: ac.acl_sourceid_c
                                                  ->  Gather Merge  (cost=174730.66..178765.84 rows=32496 width=45)
                                                        Workers Planned: 2
                                                        ->  Partial GroupAggregate  (cost=173730.63..174014.97 rows=16248 width=45)
                                                              Group Key: ac.acl_sourceid_c
                                                              ->  Sort  (cost=173730.63..173771.25 rows=16248 width=127)
                                                                    Sort Key: ac.acl_sourceid_c
                                                                    ->  Hash Join  (cost=13.62..172594.25 rows=16248 width=127)
                                                                          Hash Cond: ((ac.acl_targetid_c)::text = (s.sha_id_c)::text)
                                                                          ->  Parallel Seq Scan on t_acl ac  (cost=0.00..164812.88 rows=2924710 width=74)
                                                                                Filter: (acl_deletedate_d IS NULL)
                                                                          ->  Hash  (cost=13.60..13.60 rows=2 width=90)
                                                                                ->  Seq Scan on t_share s  (cost=0.00..13.60 rows=2 width=90)
                                                                                      Filter: (sha_deletedate_d IS NULL)
                                ->  Finalize GroupAggregate  (cost=395305.41..809195.92 rows=3445448 width=45)
                                      Group Key: f.fil_iddoc_c
                                      ->  Gather Merge  (cost=395305.41..760054.52 rows=2937384 width=45)
                                            Workers Planned: 2
                                            ->  Partial GroupAggregate  (cost=394305.39..420007.50 rows=1468692 width=45)
                                                  Group Key: f.fil_iddoc_c
                                                  ->  Sort  (cost=394305.39..397977.12 rows=1468692 width=74)
                                                        Sort Key: f.fil_iddoc_c
                                                        ->  Parallel Seq Scan on t_file f  (cost=0.00..113344.53 rows=1468692 width=74)
                                                              Filter: (fil_deletedate_d IS NULL)
                          ->  Nested Loop  (cost=22.62..33.06 rows=1 width=598)
                                Join Filter: (((rs.rtp_idroute_c)::text = (rs_1.rtp_idroute_c)::text) AND (rs.rtp_order_n = (min(rs_1.rtp_order_n))))
                                ->  GroupAggregate  (cost=22.62..22.65 rows=1 width=184)
                                      Group Key: rs_1.rtp_idroute_c, r.rte_iddocument_c
                                      ->  Sort  (cost=22.62..22.63 rows=1 width=184)
                                            Sort Key: rs_1.rtp_idroute_c, r.rte_iddocument_c
                                            ->  Nested Loop  (cost=0.00..22.61 rows=1 width=184)
                                                  Join Filter: ((rs_1.rtp_idroute_c)::text = (r.rte_id_c)::text)
                                                  ->  Seq Scan on t_route_step rs_1  (cost=0.00..10.30 rows=1 width=94)
                                                        Filter: ((rtp_deletedate_d IS NULL) AND (rtp_enddate_d IS NULL))
                                                  ->  Seq Scan on t_route r  (cost=0.00..12.30 rows=1 width=180)
                                                        Filter: (rte_deletedate_d IS NULL)
                                ->  Seq Scan on t_route_step rs  (cost=0.00..10.38 rows=2 width=602)
                                      Filter: ((rtp_idtarget_c)::text = ANY ('{administrators,f0c9b0b9-b624-4400-8cd5-84d5ec861bb6}'::text[]))
                    ->  Index Scan using idx_dot_composite on t_document_tag dt0  (cost=0.56..8.58 rows=1 width=37)
                          Index Cond: (((dot_iddocument_c)::text = (d.doc_id_c)::text) AND ((dot_idtag_c)::text = '32b108c8-86a6-436f-9b58-9085b2385429'::text) AND (dot_deletedate_d IS NULL))
                          Filter: (dot_id_c IS NOT NULL)

I've tried to improve the query without changing the general behavior but failed.

One of the thing that makes the query very slow is the T_DOCUMENT -> T_FILE join that is needed to find the number of files contained in each document, it joins the files to the documents before the document are filtered, so it does lots of work that are not used in the final result.

As this join don't filter the lines but only add some columns, a possible approach is to split the query in two:

one that searches the documents and ignore the files
one that retrieves the files count info from the document ids

This is the updated document query:

SELECT COUNT(*) AS RESULT_COUNT
FROM
    (SELECT DISTINCT D.DOC_ID_C C0,
            D.DOC_TITLE_C C1,
            D.DOC_DESCRIPTION_C C2,
            D.DOC_CREATEDATE_D C3,
            D.DOC_LANGUAGE_C C4,
            D.DOC_IDFILE_C,
            S.COUNT C5,
            RS2.RTP_ID_C C7,
            RS2.RTP_NAME_C,
            D.DOC_UPDATEDATE_D C8
        FROM T_DOCUMENT D
        LEFT JOIN
            (SELECT COUNT(S.SHA_ID_C) COUNT,
                    AC.ACL_SOURCEID_C
                FROM T_SHARE S,
                    T_ACL AC
                WHERE AC.ACL_TARGETID_C = S.SHA_ID_C
                    AND AC.ACL_DELETEDATE_D IS NULL
                    AND S.SHA_DELETEDATE_D IS NULL
                GROUP BY AC.ACL_SOURCEID_C) S ON S.ACL_SOURCEID_C = D.DOC_ID_C
        LEFT JOIN
            (SELECT RS.*,
                    RS3.IDDOCUMENT
                FROM T_ROUTE_STEP RS
                JOIN
                    (SELECT R.RTE_IDDOCUMENT_C IDDOCUMENT,
                            RS.RTP_IDROUTE_C IDROUTE,
                            MIN(RS.RTP_ORDER_N) MINORDER
                        FROM T_ROUTE_STEP RS
                        JOIN T_ROUTE R ON R.RTE_ID_C = RS.RTP_IDROUTE_C
                        AND R.RTE_DELETEDATE_D IS NULL
                        WHERE RS.RTP_DELETEDATE_D IS NULL
                            AND RS.RTP_ENDDATE_D IS NULL
                        GROUP BY RS.RTP_IDROUTE_C,
                            R.RTE_IDDOCUMENT_C) RS3 ON RS.RTP_IDROUTE_C = RS3.IDROUTE
                AND RS.RTP_ORDER_N = RS3.MINORDER
                WHERE RS.RTP_IDTARGET_C IN ('administrators', 'f0c9b0b9-b624-4400-8cd5-84d5ec861bb6')) RS2 ON RS2.IDDOCUMENT = D.DOC_ID_C
        LEFT JOIN T_DOCUMENT_TAG DT0 ON DT0.DOT_IDDOCUMENT_C = D.DOC_ID_C
        AND DT0.DOT_IDTAG_C = '32b108c8-86a6-436f-9b58-9085b2385429'
        AND DT0.DOT_DELETEDATE_D IS NULL
        WHERE D.DOC_TITLE_C IN ('plop')
            AND (DT0.DOT_ID_C IS NOT NULL)
            AND D.DOC_DELETEDATE_D IS NULL
        ORDER BY C0 DESC) AS T1code

Aggregate  (cost=179856.04..179856.05 rows=1 width=8)
  ->  Unique  (cost=179856.00..179856.03 rows=1 width=1152)
        ->  Sort  (cost=179856.00..179856.01 rows=1 width=1152)
              Sort Key: d.doc_id_c DESC, d.doc_description_c, d.doc_createdate_d, d.doc_language_c, d.doc_idfile_c, (count(s.sha_id_c)), rs.rtp_id_c, rs.rtp_name_c, d.doc_updatedate_d
              ->  Nested Loop  (cost=174762.42..179855.99 rows=1 width=1152)
                    ->  Nested Loop Left Join  (cost=174761.86..179847.41 rows=1 width=1152)
                          Join Filter: ((r.rte_iddocument_c)::text = (d.doc_id_c)::text)
                          ->  Merge Left Join  (cost=174739.24..179814.33 rows=1 width=644)
                                Merge Cond: ((d.doc_id_c)::text = (ac.acl_sourceid_c)::text)
                                ->  Sort  (cost=8.58..8.59 rows=1 width=636)
                                      Sort Key: d.doc_id_c
                                      ->  Index Scan using idx_doc_title_c on t_document d  (cost=0.56..8.57 rows=1 width=636)
                                            Index Cond: ((doc_title_c)::text = 'SubscriptionContract-328424'::text)
                                            Filter: (doc_deletedate_d IS NULL)
                                ->  Finalize GroupAggregate  (cost=174730.66..179318.28 rows=38996 width=45)
                                      Group Key: ac.acl_sourceid_c
                                      ->  Gather Merge  (cost=174730.66..178765.84 rows=32496 width=45)
                                            Workers Planned: 2
                                            ->  Partial GroupAggregate  (cost=173730.63..174014.97 rows=16248 width=45)
                                                  Group Key: ac.acl_sourceid_c
                                                  ->  Sort  (cost=173730.63..173771.25 rows=16248 width=127)
                                                        Sort Key: ac.acl_sourceid_c
                                                        ->  Hash Join  (cost=13.62..172594.25 rows=16248 width=127)
                                                              Hash Cond: ((ac.acl_targetid_c)::text = (s.sha_id_c)::text)
                                                              ->  Parallel Seq Scan on t_acl ac  (cost=0.00..164812.88 rows=2924710 width=74)
                                                                    Filter: (acl_deletedate_d IS NULL)
                                                              ->  Hash  (cost=13.60..13.60 rows=2 width=90)
                                                                    ->  Seq Scan on t_share s  (cost=0.00..13.60 rows=2 width=90)
                                                                          Filter: (sha_deletedate_d IS NULL)
                          ->  Nested Loop  (cost=22.62..33.06 rows=1 width=598)
                                Join Filter: (((rs.rtp_idroute_c)::text = (rs_1.rtp_idroute_c)::text) AND (rs.rtp_order_n = (min(rs_1.rtp_order_n))))
                                ->  GroupAggregate  (cost=22.62..22.65 rows=1 width=184)
                                      Group Key: rs_1.rtp_idroute_c, r.rte_iddocument_c
                                      ->  Sort  (cost=22.62..22.63 rows=1 width=184)
                                            Sort Key: rs_1.rtp_idroute_c, r.rte_iddocument_c
                                            ->  Nested Loop  (cost=0.00..22.61 rows=1 width=184)
                                                  Join Filter: ((rs_1.rtp_idroute_c)::text = (r.rte_id_c)::text)
                                                  ->  Seq Scan on t_route_step rs_1  (cost=0.00..10.30 rows=1 width=94)
                                                        Filter: ((rtp_deletedate_d IS NULL) AND (rtp_enddate_d IS NULL))
                                                  ->  Seq Scan on t_route r  (cost=0.00..12.30 rows=1 width=180)
                                                        Filter: (rte_deletedate_d IS NULL)
                                ->  Seq Scan on t_route_step rs  (cost=0.00..10.38 rows=2 width=602)
                                      Filter: ((rtp_idtarget_c)::text = ANY ('{administrators,f0c9b0b9-b624-4400-8cd5-84d5ec861bb6}'::text[]))
                    ->  Index Scan using idx_dot_composite on t_document_tag dt0  (cost=0.56..8.58 rows=1 width=37)
                          Index Cond: (((dot_iddocument_c)::text = (d.doc_id_c)::text) AND ((dot_idtag_c)::text = '32b108c8-86a6-436f-9b58-9085b2385429'::text) AND (dot_deletedate_d IS NULL))
                          Filter: (dot_id_c IS NOT NULL)

The cost changed from 1032120 to 179856 and it now takes around 100 ms.

I added an index to the T_FILE to make finding files by documents id faster and its cost is not negligible (adding this index would also gives a small improvement to the original query but it's was not enough).

For the implementation I used a variation of what I did for https://github.com/sismics/docs/pull/582 .

When working on the query I noticed some guarding clauses like if (criteria.getTargetIdList() != null) were useless and made the code more confusing because I looked for situation where they are needed to be sure I didn't missed any case, so I removed them.

Feedbacks on the approach are welcome, I'm a bit sad that I had to use such a blunt solution.

jendib commented 1 year ago

We could also maintain a "fileCount" in t_document. I'm not sure which solution is the best.

archiloque commented 1 year ago

@jendib I have a bias against information duplication because I really dislike hunting down desynchronisation issues. On the other hand after checking the code it seems it could be done in com.sismics.docs.core.event.DocumentUpdatedAsyncEvent so maybe the desynchronisation risk is low in this case. On the other hand if I understand the architecture, it would introduce a delay before the file count change is visible in the API, I don't know if it's acceptable or if there is a better place to add the count update code.

In all cases I would be happy to implement it the way you prefer.

archiloque commented 1 year ago

@jendib I'd like to find a way to fix the original issue if possible, if you have a hard time to decide how to deal with it maybe we could have a chat / visio to discuss it?

jendib commented 1 year ago

I'm surprised the query cannot be optimized without doing this kind of trick (indexes or query optimization).

archiloque commented 1 year ago

I'm surprised the query cannot be optimized without doing this kind of trick (indexes or query optimization).

I've tried several approaches but couldn't figure a way to do it, if you have suggestions I would be happy to try them.

jendib commented 1 year ago

@archiloque I'm going to merge this, do you have anything you want to add before that?

archiloque commented 1 year ago

If you want we can discuss the approach (maybe using a visio if you have alternate ideas). I just merged the latest change in the PR and will do more tests tomorrow to check if everything is OK => I can tell you when I'm done with the validations and then you can do the merge afterward ?

archiloque commented 1 year ago

Finished all the tests and everything seems OK on my side

jendib commented 1 year ago

Thanks

sismics / docs

Make search for documents faster for large dataset #698