stemangiola / CuratedAtlasQueryR

Tidy R query API for the harmonised and curated CELLxGENE single-cell atlas.
https://stemangiola.github.io/CuratedAtlasQueryR/
GNU General Public License v3.0
89 stars 7 forks source link

Database update process #74

Open multimeric opened 1 year ago

multimeric commented 1 year ago

Count Data Update

  1. The data in nectar is deleted using swift delete harmonised-human-atlas
  2. The new data is uploaded using:
    swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_0.2 --object-name original --segment-size 5000000000
    swift upload harmonised-human-atlas /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 --object-name cpm --segment-size 5000000000
  3. The REMOTE_URL is updated in the R package
  4. Local cache needs to be given appropriate permissions:
    chmod --recursive a+rX /vast/projects/cellxgene_curated/splitted_DB2_data_scaled_0.2 /vast/projects/cellxgene_curated/splitted_DB2_data_0.2

Metadata File Update

  1. The old metadata is deleted using swift delete metadata
  2. The new metadata is uploaded using swift upload metadata /vast/projects/RCP/human_cell_atlas/metadata.0.2.2.parquet --object-name metadata.0.2.2.parquet
  3. The default remote_url for the metadata is updated to this new path
stemangiola commented 1 year ago

Fabulous, for now, is good.

We should think about future updates, especially regarding the data. What disruption happens in the update process?

multimeric commented 1 year ago

There will be downtime no matter what, because we can't keep old data. What I might do is add a message to the user if the download fails to check if they have the latest R package version, because it might mean that we have updated and they haven't.

stemangiola commented 1 year ago

Yes, but also in future we might use Anndata for everything, doubling our capacity, and if the API is successful ask on top for 5x resources that can include 2, 3 DB versions.