serratus-bio / serratus.io

Front-end code for Serratus project website
https://serratus.io
GNU Affero General Public License v3.0
11 stars 11 forks source link

RdRp Geo map - Switch from static CSV to API for Map Data #225

Closed ababaian closed 1 year ago

ababaian commented 1 year ago

The beta page for RdRp Geo Map (https://serratus.io/geo) with code components in https://github.com/serratus-bio/serratus.io/tree/main/src/components/Geo is a bit spartan.

The underlying data which the MapPlot.tsx function plots is stored in a static CSV file hosted on the server main/src/components/Geo/rdrp_pos.tsv and you can see the columns available here:

un_id   biosample_id    release_date    tax_id  scientific_name coordinate_x    coordinate_y    from_text
DRR021448   SAMD00018415    2015-07-14 10:38:15 "318829"    Pyricularia oryzae  -53.073466889   -10.769946429   brazil
DRR022241   SAMD00018962    2015-02-17 09:27:39 "565456"    Thujopsis dolabrata var. hondae 140.74741987    40.82222627 japan:aomori
DRR022329   SAMD00019235    2018-01-09 20:09:36 "941566"    Enkianthus perulatus    134.22217792    35.49441833 japan: tottori
DRR022330   SAMD00019235    2018-01-09 20:09:37 "941566"    Enkianthus perulatus    134.22217792    35.49441833 japan: tottori
DRR023219   SAMD00019539    2014-12-04 10:09:40 "79674" Macrobrachium rosenbergii   85.85766    20.18689    india: odisha, bhubaneswar, central institute of freshwater aquaculture
DRR023776   SAMD00020837    2015-02-14 09:39:34 "458240"    Arum concinnatum    25.14238    35.34194    greece: crete island
DRR024737   SAMD00022880    2015-11-24 10:21:11 "6161"  Dugesia japonica    137.04074133    35.752191505    japan:gifu, gujo, iruma river
DRR024738   SAMD00022878    2015-11-24 10:21:11 "6161"  Dugesia japonica    137.04074133    35.752191505    japan:gifu, gujo, iruma river
DRR029067   SAMD00024755    2015-01-26 10:28:37 "64793" Wasmannia auropunctata  -52.97901   5.067   
DRR029069   SAMD00024757    2015-01-26 10:28:36 "64793" Wasmannia auropunctata  -52.97901   5.067   
...

The rdrp_pos.tsv file is a downloaded file from the serratuSQL server, based on the materialized view table (for login see: https://github.com/ababaian/serratus/wiki/SQL-Schema)

select distinct on (run_id)
    run_id,
    biosample_id,
    srarun.release_date,
    tax_id,
    scientific_name,
    coordinate_x,
    coordinate_y,
    from_text
from srarun_geo_coordinates sgc
inner join rdrp_pos using (run_id)
inner join srarun on (run_id = srarun.run)
order by run_id

Task:

1) Add a REST API to query to api.serratus.io (https://github.com/serratus-bio/serratus-summary-api) to return the rdrp_pos table data from the SQL server

2) Add an API call to Geo page to download the rdrp_pos table and pass the data to MapPlot.tsx for plotting.

This will allow real-time/updating of the SQL server which will fix the data displayed in Map Geo.

lukepereira commented 1 year ago

Hey @victorlin, I noticed in this wiki page:

When referenced tables are replaced, these views should be dropped and recreated. REFRESH MATERIALIZED VIEW doesn't seem to work in this case.

Since the query for this task depends on the materialized view srarun_geo_coordinates, there would be some downtime whenever we drop and recreate it. I was wondering, do you remember why the refresh command doesn't work?

I'm not sure how long the materialized view create query takes, but I don't think it's a blocker if we're okay with some downtime.

victorlin commented 1 year ago

Hey @lukepereira,

Unfortunately, I don't remember why REFRESH MATERIALIZED VIEW didn't work. A quick internet search shows that this might have been a RDS instance version problem.

I don't think downtime is a big problem, but to minimize downtime you can use renames instead. Something like:

CREATE MATERIALIZED VIEW new_view AS …;
ALTER MATERIALIZED VIEW live_view RENAME TO old_view;
ALTER MATERIALIZED VIEW new_view RENAME TO live_view;
DROP MATERIALIZED VIEW old_view;
ababaian commented 1 year ago

Hey @lukepereira,

Unfortunately, I don't remember why REFRESH MATERIALIZED VIEW didn't work. A quick internet search shows that this might have been a RDS instance version problem.

I don't think downtime is a big problem, but to minimize downtime you can use renames instead. Something like:

CREATE MATERIALIZED VIEW new_view AS …;
ALTER MATERIALIZED VIEW live_view RENAME TO old_view;
ALTER MATERIALIZED VIEW new_view RENAME TO live_view;
DROP MATERIALIZED VIEW old_view;

This might not work. Last time I renamed a table, it actually renamed every reference of that table automatically across different materialized views. So if there is a view which references a view, it might not update. Exercise caution before dropping table.

lukepereira commented 1 year ago

Thank you both for the info, I will dig into the docs more to understand what the best options are

lukepereira commented 1 year ago

Unfortunately, it looks like the cache isn't working as expected and always seems to be empty. It's a blocker for launching since fetching the large number of rows (~93k) is too slow and expensive to run on every map page load. I opened an issue to investigate it here: https://github.com/serratus-bio/serratus-summary-api/issues/38

Some options to consider:

  1. Fix or implement cache and launch with geo query being cached
  2. Check if using a new materialized view is substantially faster than current query with inner joins
  3. Update UI and database queries to prevent needing to fetch and render all 90k data-points
    • can likely use PSQL extension PostGIS to run queries that spatially cluster points ref
  4. Continue using CSV approach
    • consider implementing step to check for and download an updated .csv file from the API when bundling code

I'll spend more time investigating the options and provide an update soon. We can also discuss this more in person since i'll be in the lab later this week.

lukepereira commented 1 year ago

This change is live now after fixing the SimpleCache for the geo endpoint (1 from above list).

I will likely continue working on geo related improvements, i.e. (2) and (3) for performance and UX improvements which will also support better filtering capabilities later on. I'll also take a look at the existing geo tasks in the backlog