Use gene_symbol including logic for scanpy import with or without pandas routes.

pcm32 commented 3 years ago

Currently when using pandas the var['gene_symbol'], which is used when not using pandas, is neglected. This enables the gene_symbol identification to work in both routes.

pcm32 commented 3 years ago

I haven't tested this though, do you have any working tests @matthewspeir @maximilianh that I could add to some GitHub actions here? Thanks!

maximilianh commented 3 years ago

This should not fix your problem. usePandas is always false, isn't it?

I wonder if your problem has to do with the raw values. Have you already run with the -d option set?

pcm32 commented 3 years ago

You are right, it is not using the pandas section (I thought it was using it if pandas was installed).

What do you suggest to fix this? ahhh, -d, let me check that.

pcm32 commented 3 years ago

Could it be due to:

INFO:root:Auto-detecting number type of /private/tmp/outdir/exprMatrix.tsv.gz
DEBUG:root:spooling back 0 saved rows
DEBUG:root:Yielding gene ENSDARG00000000001, sym ENSDARG00000000001, 96 fields
DEBUG:root:Matrix type is: float
INFO:root:Auto-detect: Numbers in matrix are of type 'float'
DEBUG:root:spooling back 1 saved rows
DEBUG:root:Yielding gene ENSDARG00000000001, sym ENSDARG00000000001, 96 fields
INFO:root:Auto-detected gene IDs type: symbols

?

Also, see attached the entire log with debugging, I removed some repetitive lines. small_UCSC_debug_atlas_gene_symbols.txt

pcm32 commented 3 years ago

Two comments back it was a different dataset I was trying...

maximilianh commented 3 years ago

Sorry, I have the impression that this means that this file simply does not contain gene symbols, is this correct?

On Fri, Apr 16, 2021 at 2:38 PM Pablo Moreno @.***> wrote:

Two comments back it was a different dataset I was trying...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821144714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMDE5DTPLBSNUMZ62TTJAVVBANCNFSM43BK7BGQ .

maximilianh commented 3 years ago

If this is the case, there is a way to make it work, but I first want to confirm that this is true.

On Fri, Apr 16, 2021 at 2:55 PM Maximilian Haeussler @.***> wrote:

Sorry, I have the impression that this means that this file simply does not contain gene symbols, is this correct?

On Fri, Apr 16, 2021 at 2:38 PM Pablo Moreno @.***> wrote:

Two comments back it was a different dataset I was trying...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821144714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMDE5DTPLBSNUMZ62TTJAVVBANCNFSM43BK7BGQ .

pcm32 commented 3 years ago

The file should have gene symbols. The structure of the annData is:

AnnData object with n_obs × n_vars = 96 × 17500
    obs: 'age', 'developmental_stage', 'genotype', 'organism_part', 'organism', 'phenotype', 'post_analysis_well_quality', 'single_cell_quality', 'single_cell_well_quality', 'block', 'phenotype.1', 'single_cell_identifier', 'age_ontology', 'developmental_stage_ontology', 'genotype_ontology', 'organism_part_ontology', 'organism_ontology', 'phenotype_ontology', 'post_analysis_well_quality_ontology', 'single_cell_quality_ontology', 'single_cell_well_quality_ontology', 'block_ontology', 'phenotype_ontology.1', 'single_cell_identifier_ontology', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_genes', 'louvain_resolution_0.7', 'louvain_resolution_1.0'
    var: 'gene_symbols', 'chromosome', 'start', 'end', 'width', 'source', 'type', 'score', 'phase', 'gene_version', 'gene_name', 'gene_source', 'gene_biotype', 'mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg', 'markers_louvain_resolution_0.7', 'markers_louvain_resolution_0.7_filtered', 'markers_louvain_resolution_1.0', 'markers_louvain_resolution_1.0_filtered', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_tsne_perplexity_1', 'X_tsne_perplexity_10', 'X_tsne_perplexity_15', 'X_tsne_perplexity_20', 'X_tsne_perplexity_25', 'X_tsne_perplexity_30', 'X_tsne_perplexity_35', 'X_tsne_perplexity_40', 'X_tsne_perplexity_45', 'X_tsne_perplexity_5', 'X_tsne_perplexity_50', 'X_umap_neighbors_n_neighbors_10', 'X_umap_neighbors_n_neighbors_100', 'X_umap_neighbors_n_neighbors_15', 'X_umap_neighbors_n_neighbors_20', 'X_umap_neighbors_n_neighbors_25', 'X_umap_neighbors_n_neighbors_3', 'X_umap_neighbors_n_neighbors_30', 'X_umap_neighbors_n_neighbors_5', 'X_umap_neighbors_n_neighbors_50'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

so you can see 'gene_symbols' under 'var' and then, the var contains:

....but,... aha, you are right, there is an issue with the gene symbols:

index	gene_symbols	chromosome	start	end	width	source	type	gene_version	gene_name	gene_source	gene_biotype	mito	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts	n_counts	n_cells	highly_variable	means	dispersions	dispersions_norm
ENSDARG00000000001	ENSDARG00000000001	9	34112067	34121839	9773	ensembl_havana	gene	6	slc35a5	ensembl_havana	protein_coding	False	15	10.391168	2.4328382	84.375	997.55206	6.9063063	997.55206	15	False	0.048443687103584876	-0.08688698682354548	0.20369561
ENSDARG00000000002	ENSDARG00000000002	9	34089156	34113209	24054	ensembl_havana	gene	8	ccdc80	ensembl_havana	protein_coding	False	4	6.692166	2.0402024	95.83333333333334	642.44794	6.466841	642.44794	4	True	0.038212520539707966	1.1615255578255084	1.0531479
ENSDARG00000000018	ENSDARG00000000018	4	15081385	15103696	22312	ensembl_havana	gene	9	nrf1	ensembl_havana	protein_coding	False	93	483.9896	6.1841273	3.125	46463.0	10.746433	46463.0	93	False	1.2718640571022153	1.6039805832956242	0.038439106
ENSDARG00000000019	ENSDARG00000000019	4	15011341	15059876	48536	ensembl_havana	gene	9	ube2h	ensembl_havana	protein_coding	False	27	55.15625	4.028138	71.875	5295.0	8.574707	5295.0	27	True	0.24869120688275356	1.5673604972588937	1.3292886
ENSDARG00000000068	ENSDARG00000000068	12	33484458	33537126	52669	ensembl_havana	gene	9	slc9a3r1a	ensembl_havana	protein_coding	False	40	60.479168	4.1186986	58.33333333333333	5806.0	8.66682	5806.0	40	True	0.24480807109196906	1.2071062863219768	1.0841622

pcm32 commented 3 years ago

So this is an issue with our AnnData generation... sorry about this.

maximilianh commented 3 years ago

Great! :-)

how strange... what is "gene_name" ? I haven't seen his field yet...

BTW: awesome that you have h5ad files now!! I wrote hundreds of lines of code to convert your text files to cell browser files, but then at some point gave up, forgot why. It would be cool to try again with the h5ad files...

On Fri, Apr 16, 2021 at 3:07 PM Pablo Moreno @.***> wrote:

So this is an issue with our AnnData generation... sorry about this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/217#issuecomment-821161156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TP67E5KLQERQHSRSBLTJAZBNANCNFSM43BK7BGQ .

maximilianh commented 3 years ago

Hi Pablo, can I close this pull request?

We're just releasing 1.0.1 which includes the fix for the "import scanpy" + "exit code 0" problem that you found recently.

maximilianh / cellBrowser

Use gene_symbol including logic for scanpy import with or without pandas routes. #217