theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
412 stars 102 forks source link

More information about scvelo.datasets #1239

Closed mortunco closed 5 months ago

mortunco commented 5 months ago

Hello.

First of thank you for already including existing datasets to the package. They are extremely useful for benchmarking pipelines.

I am trying to understand the difference between velocyto generated spliced/unspliced counts compared to STARsolo (Velocyto mode). I see STARsolo counts higher number of UMI compared to velocyto. (This is certainly not your problem). But in order to compare why the numbers are different, It would be good to know how these objects were generated, it think it would be nice to see some kind of information in the API page. For instance, (https://scvelo.readthedocs.io/en/stable/scvelo.datasets.pancreas.html#scvelo.datasets.pancreas). Bash script would be very ideal but I think seing CellRanger XXX, Velocyto XXXX or whichever tool used would greatly help.

gold=scv.datasets.pancreas()
print("printing Gold")
print(gold)
star=starsolo_velocity_anndata("bandidasponce_mouse/E15_5/STAR/STAR/output/Velocyto/filtered")
star_common=star[star.obs.index[star.obs.index.isin(gold.obs.index)],] ### Take barcodes that are common with gold
star_common=star[:,gold.var.index]
print("printing STAR_common")
print(star_common)
print(f"GOLD spl: {np.sum(gold.layers["spliced"])}")
print(f"GOLD uns: {np.sum(gold.layers["unspliced"])}")
print(f"STAR spl: {np.sum(star.layers["spliced"])}")
print(f"STAR uns: {np.sum(star.layers["unspliced"])}")
print(f"STAR amb: {np.sum(star.layers["ambigious"])}")

printing Gold
AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'
printing STAR_common
View of AnnData object with n_obs × n_vars = 12874 × 27998
    var: 'gene_ids', 'feature_types'
    layers: 'spliced', 'ambigious', 'unspliced'
GOLD spl: 24670872.0
GOLD uns: 4801196.0
STAR spl: 163951060
STAR uns: 12189955
STAR amb: 5389307

For my case, is it possible to learn more about pancreas data? Should I assume h5ad object genarated as it was described in the publication sup. methods ( cellranger 2.1 / velocyto 0.17.17)?

Thank you for your help and maintaining the tool actively.

Best,

T.

WeilerP commented 5 months ago

Please have a look at the papers introducing the datasets and contact the authors directly if needed.