maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
102 stars 40 forks source link

Use gene_symbol including logic for scanpy import with or without pandas routes. #217

Closed pcm32 closed 3 years ago

pcm32 commented 3 years ago

Currently when using pandas the var['gene_symbol'], which is used when not using pandas, is neglected. This enables the gene_symbol identification to work in both routes.

pcm32 commented 3 years ago

I haven't tested this though, do you have any working tests @matthewspeir @maximilianh that I could add to some GitHub actions here? Thanks!

maximilianh commented 3 years ago

This should not fix your problem. usePandas is always false, isn't it?

I wonder if your problem has to do with the raw values. Have you already run with the -d option set?

pcm32 commented 3 years ago

You are right, it is not using the pandas section (I thought it was using it if pandas was installed).

What do you suggest to fix this? ahhh, -d, let me check that.

pcm32 commented 3 years ago

Could it be due to:

INFO:root:Auto-detecting number type of /private/tmp/outdir/exprMatrix.tsv.gz
DEBUG:root:spooling back 0 saved rows
DEBUG:root:Yielding gene ENSDARG00000000001, sym ENSDARG00000000001, 96 fields
DEBUG:root:Matrix type is: float
INFO:root:Auto-detect: Numbers in matrix are of type 'float'
DEBUG:root:spooling back 1 saved rows
DEBUG:root:Yielding gene ENSDARG00000000001, sym ENSDARG00000000001, 96 fields
INFO:root:Auto-detected gene IDs type: symbols

?

Also, see attached the entire log with debugging, I removed some repetitive lines. small_UCSC_debug_atlas_gene_symbols.txt

pcm32 commented 3 years ago

Two comments back it was a different dataset I was trying...

maximilianh commented 3 years ago

Sorry, I have the impression that this means that this file simply does not contain gene symbols, is this correct?

On Fri, Apr 16, 2021 at 2:38 PM Pablo Moreno @.***> wrote:

Two comments back it was a different dataset I was trying...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821144714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMDE5DTPLBSNUMZ62TTJAVVBANCNFSM43BK7BGQ .

maximilianh commented 3 years ago

If this is the case, there is a way to make it work, but I first want to confirm that this is true.

On Fri, Apr 16, 2021 at 2:55 PM Maximilian Haeussler @.***> wrote:

Sorry, I have the impression that this means that this file simply does not contain gene symbols, is this correct?

On Fri, Apr 16, 2021 at 2:38 PM Pablo Moreno @.***> wrote:

Two comments back it was a different dataset I was trying...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821144714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMDE5DTPLBSNUMZ62TTJAVVBANCNFSM43BK7BGQ .

pcm32 commented 3 years ago

The file should have gene symbols. The structure of the annData is:

AnnData object with n_obs × n_vars = 96 × 17500
    obs: 'age', 'developmental_stage', 'genotype', 'organism_part', 'organism', 'phenotype', 'post_analysis_well_quality', 'single_cell_quality', 'single_cell_well_quality', 'block', 'phenotype.1', 'single_cell_identifier', 'age_ontology', 'developmental_stage_ontology', 'genotype_ontology', 'organism_part_ontology', 'organism_ontology', 'phenotype_ontology', 'post_analysis_well_quality_ontology', 'single_cell_quality_ontology', 'single_cell_well_quality_ontology', 'block_ontology', 'phenotype_ontology.1', 'single_cell_identifier_ontology', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_genes', 'louvain_resolution_0.7', 'louvain_resolution_1.0'
    var: 'gene_symbols', 'chromosome', 'start', 'end', 'width', 'source', 'type', 'score', 'phase', 'gene_version', 'gene_name', 'gene_source', 'gene_biotype', 'mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'markers_louvain_resolution_0.7', 'markers_louvain_resolution_0.7_filtered', 'markers_louvain_resolution_1.0', 'markers_louvain_resolution_1.0_filtered', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_tsne_perplexity_1', 'X_tsne_perplexity_10', 'X_tsne_perplexity_15', 'X_tsne_perplexity_20', 'X_tsne_perplexity_25', 'X_tsne_perplexity_30', 'X_tsne_perplexity_35', 'X_tsne_perplexity_40', 'X_tsne_perplexity_45', 'X_tsne_perplexity_5', 'X_tsne_perplexity_50', 'X_umap_neighbors_n_neighbors_10', 'X_umap_neighbors_n_neighbors_100', 'X_umap_neighbors_n_neighbors_15', 'X_umap_neighbors_n_neighbors_20', 'X_umap_neighbors_n_neighbors_25', 'X_umap_neighbors_n_neighbors_3', 'X_umap_neighbors_n_neighbors_30', 'X_umap_neighbors_n_neighbors_5', 'X_umap_neighbors_n_neighbors_50'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

so you can see 'gene_symbols' under 'var' and then, the var contains:

....but,... aha, you are right, there is an issue with the gene symbols:

index gene_symbols chromosome start end width source type score phase gene_version gene_name gene_source gene_biotype mito n_cells_by_counts mean_counts log1p_mean_counts pct_dropout_by_counts total_counts log1p_total_counts n_counts n_cells highly_variable means dispersions dispersions_norm
ENSDARG00000000001 ENSDARG00000000001 9 34112067 34121839 9773 ensembl_havana gene     6 slc35a5 ensembl_havana protein_coding False 15 10.391168 2.4328382 84.375 997.55206 6.9063063 997.55206 15 False 0.048443687103584876 -0.08688698682354548 0.20369561
ENSDARG00000000002 ENSDARG00000000002 9 34089156 34113209 24054 ensembl_havana gene     8 ccdc80 ensembl_havana protein_coding False 4 6.692166 2.0402024 95.83333333333334 642.44794 6.466841 642.44794 4 True 0.038212520539707966 1.1615255578255084 1.0531479
ENSDARG00000000018 ENSDARG00000000018 4 15081385 15103696 22312 ensembl_havana gene     9 nrf1 ensembl_havana protein_coding False 93 483.9896 6.1841273 3.125 46463.0 10.746433 46463.0 93 False 1.2718640571022153 1.6039805832956242 0.038439106
ENSDARG00000000019 ENSDARG00000000019 4 15011341 15059876 48536 ensembl_havana gene     9 ube2h ensembl_havana protein_coding False 27 55.15625 4.028138 71.875 5295.0 8.574707 5295.0 27 True 0.24869120688275356 1.5673604972588937 1.3292886
ENSDARG00000000068 ENSDARG00000000068 12 33484458 33537126 52669 ensembl_havana gene     9 slc9a3r1a ensembl_havana protein_coding False 40 60.479168 4.1186986 58.33333333333333 5806.0 8.66682 5806.0 40 True 0.24480807109196906 1.2071062863219768 1.0841622
pcm32 commented 3 years ago

So this is an issue with our AnnData generation... sorry about this.

maximilianh commented 3 years ago

Great! :-)

how strange... what is "gene_name" ? I haven't seen his field yet...

BTW: awesome that you have h5ad files now!! I wrote hundreds of lines of code to convert your text files to cell browser files, but then at some point gave up, forgot why. It would be cool to try again with the h5ad files...

On Fri, Apr 16, 2021 at 3:07 PM Pablo Moreno @.***> wrote:

So this is an issue with our AnnData generation... sorry about this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821161156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TP67E5KLQERQHSRSBLTJAZBNANCNFSM43BK7BGQ .

maximilianh commented 3 years ago

Hi Pablo, can I close this pull request?

We're just releasing 1.0.1 which includes the fix for the "import scanpy" + "exit code 0" problem that you found recently.