Closed archiloque closed 1 year ago
We could also maintain a "fileCount" in t_document. I'm not sure which solution is the best.
@jendib I have a bias against information duplication because I really dislike hunting down desynchronisation issues. On the other hand after checking the code it seems it could be done in com.sismics.docs.core.event.DocumentUpdatedAsyncEvent
so maybe the desynchronisation risk is low in this case.
On the other hand if I understand the architecture, it would introduce a delay before the file count change is visible in the API, I don't know if it's acceptable or if there is a better place to add the count update code.
In all cases I would be happy to implement it the way you prefer.
@jendib I'd like to find a way to fix the original issue if possible, if you have a hard time to decide how to deal with it maybe we could have a chat / visio to discuss it?
I'm surprised the query cannot be optimized without doing this kind of trick (indexes or query optimization).
I'm surprised the query cannot be optimized without doing this kind of trick (indexes or query optimization).
I've tried several approaches but couldn't figure a way to do it, if you have suggestions I would be happy to try them.
@archiloque I'm going to merge this, do you have anything you want to add before that?
If you want we can discuss the approach (maybe using a visio if you have alternate ideas). I just merged the latest change in the PR and will do more tests tomorrow to check if everything is OK => I can tell you when I'm done with the validations and then you can do the merge afterward ?
Finished all the tests and everything seems OK on my side
Thanks
We use Teedy to store lots of documents (currently 3.5 millions) and the query that search for documents has become very slow on PostgreSQL (between 5 and 10 seconds on my machine). It means a bad expercience when the API is called synchronously by the users, and it makes some batch tools we use very slow.
This is the query:
This is the query plan.
I've tried to improve the query without changing the general behavior but failed.
One of the thing that makes the query very slow is the
T_DOCUMENT
->T_FILE
join that is needed to find the number of files contained in each document, it joins the files to the documents before the document are filtered, so it does lots of work that are not used in the final result.As this join don't filter the lines but only add some columns, a possible approach is to split the query in two:
This is the updated document query:
The cost changed from 1032120 to 179856 and it now takes around 100 ms.
I added an index to the
T_FILE
to make finding files by documents id faster and its cost is not negligible (adding this index would also gives a small improvement to the original query but it's was not enough).For the implementation I used a variation of what I did for https://github.com/sismics/docs/pull/582 .
When working on the query I noticed some guarding clauses like
if (criteria.getTargetIdList() != null)
were useless and made the code more confusing because I looked for situation where they are needed to be sure I didn't missed any case, so I removed them.Feedbacks on the approach are welcome, I'm a bit sad that I had to use such a blunt solution.