pepkit / pepdbagent

Database for storing sample metadata
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Selecting projects metadata using limit and huge (100K) offset is very slow #87

Closed khoroshevskyi closed 11 months ago

khoroshevskyi commented 1 year ago

In geo namespace(more then 160K projects) I clicked End (to see the last projects in the geo namespace). It broke pephub. The reason of this slow execution time of the db query.

Reason:

Slow execution

To query description from the project value project_value ->> 'description' it takes more then a minute to execute this query (Query complete 00:01:10.094):

select name, namespace, tag, project_value ->> 'description' AS description from projects order by last_update_date limit 5 offset 100005; 

Fast execution

To query all project, without extracting description from project_value takes around second (Query complete 00:00:01.195)

select name, namespace, tag, project_value from projects order by last_update_date limit 5 offset 100005;

Proposed solutions

1) #71 2) Move description into separate column (easier solution:) )

nsheff commented 1 year ago

two questions:

  1. does removing the order by clause change the time?
  2. is there any way to add an index that would speed it up? like, indexing the project_value? not sure it's worth it...
  3. what about just not querying description here? it makes sense to me that for a global view, you'd only use the "meta" properties, not something inside the project... (I guess that means in the long run, moving description into its own column, as you suggest)
khoroshevskyi commented 1 year ago

1) Order by is not changing the time. 2) It's impossible, because project_value sometimes includes >1000 Samples, And project_value is the biggest column 3) If we won't query this information, than we won't have description in namespace window: image

khoroshevskyi commented 11 months ago

This issue was solved by adding new column description to the projects table in the database