maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
104 stars 40 forks source link

Capture metadata in Loom files #78

Closed matthewspeir closed 5 years ago

matthewspeir commented 5 years ago

I can see when I run cbScanpy on a loom file that it picks out that there is metadata (note the '... storing' lines):

Computing percentage of mitochondrial genes
... storing 'construction_approach_label' as categorical
... storing 'construction_approach_ontology' as categorical
... storing 'disease_label' as categorical
... storing 'disease_ontology' as categorical
... storing 'donorkey' as categorical
... storing 'end_bias' as categorical
... storing 'ethnicity_label' as categorical
... storing 'ethnicity_ontology' as categorical
... storing 'genus_species_label' as categorical
... storing 'genus_species_ontology' as categorical
... storing 'input_nucleic_acid_label' as categorical
... storing 'input_nucleic_acid_ontology' as categorical
... storing 'librarykey' as categorical
... storing 'organ_label' as categorical
... storing 'organ_ontology' as categorical
... storing 'organ_part_label' as categorical
... storing 'organ_part_ontology' as categorical
... storing 'protocol' as categorical
... storing 'short_name' as categorical
... storing 'strand' as categorical
... storing 'chromosome' as categorical
... storing 'featuretype' as categorical
Remove cells with less than 10 and more than 15000 genes
Filtering cells
After filtering: Data has 581 samples/observations and 35112 genes/variables
Expression normalization, counts per cell = 10000

But when you look at the meta.tsv afterward, there's nothing in there but the standard 'louvain cluster', etc. columns. It's like it just disappears. Is there a way to capture this metadata?

maximilianh commented 5 years ago

Oh. I had no idea that cbScanpy can read loom files, I didn't know that loom files store meta data or how to get it out of there. cbScanpy currently doesn't officially support loom the help message doesn't list it as a supported file format.

That being said maybe it should support it? Where is this loom file? How did you run it on the loom file?

A loom file should already store all the information we need, so maybe running on loom files with cbScanpy doesn't make a lot of sense? Does anyone use Loom files? I've never seen one in the wild. Could you work around it by using an alternative file format?

matthewspeir commented 5 years ago

Yeah, I think it's just that the latest scanpy version supports loom files? I just tried it to see if it would work, and it did, haha.

The loom file is on dev here: /hive/users/mspeir/cellbrowserTest/pancreas/new_metadata_test/matrix_files/Single_cell_transcriptome_analysis_of_human_pancreas.loom

Command: cbScanpy -e Single_cell_transcriptome_analysis_of_human_pancreas.loom -o cbScanpyOut_pancreas_aging_loom -n HCA_Pancreas_Aging_Loom -s

But does it really contain all of the information needed? It contains some metadata, but it doesn't have info like Louvain Cluster, UMI Count, etc. that your program outputs. It also doesn't include any coordinates, so you would have to do some clustering, right?

The only place I've seen loom files is from the HCA DCP Data Browser. Last I checked it's the default selected option.

maximilianh commented 5 years ago

OK, so it sounds like loom files ONLY contain some meta data, no coordinates or other algorithm results. So we would need some separate tool to get the meta data out of them? Is there some other way to get this meta data in another format?

On Wed, Mar 6, 2019 at 9:05 PM Matt Speir notifications@github.com wrote:

Yeah, I think it's just that the latest scanpy version supports loom files? I just tried it to see if it would work, and it did, haha.

The loom file is on dev here:

/hive/users/mspeir/cellbrowserTest/pancreas/new_metadata_test/matrix_files/Single_cell_transcriptome_analysis_of_human_pancreas.loom

Command: cbScanpy -e Single_cell_transcriptome_analysis_of_human_pancreas.loom -o cbScanpyOut_pancreas_aging_loom -n HCA_Pancreas_Aging_Loom -s

But does it really contain all of the information needed? It contains some metadata, but it doesn't have info like Louvain Cluster, UMI Count, etc. that your program outputs. It also doesn't include any coordinates, so you would have to do some clustering, right?

The only place I've seen loom files is from the HCA DCP Data Browser. Last I checked it's the default selected option.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/78#issuecomment-470256267, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TcyzCt8VPlFxSXooOxyXUY9AUFFzks5vUB-dgaJpZM4bfhzp .

matthewspeir commented 5 years ago

Would another tool be needed? It looks like scanpy is able to extract and store the metadata based on lines like: ... storing 'construction_approach_label' as categorical

Maybe this is stored in the resulting 'anndata.h5ad' file in the cbScanpy output directory? Is there a way I could check that?

Through the DCP there's not really an easy way to get the metadata yet, though I think that's going to change in the near future. For this specific loom file (Single_cell_transcriptome_analysis_of_human_pancreas.loom), I have the same information in csv format in the cells.csv, genes.csv, and expression.csv files in the directory /hive/users/mspeir/cellbrowserTest/pancreas/new_metadata_test/matrix_files/Single_cell_transcriptome_analysis_of_human_pancreas.csv .

maximilianh commented 5 years ago

I'll give it a quick go, but I think we should not spend more time on this. If the DCP exports meta data only in loom format, then we shouldn't worry about that. Hardly anyone will be able to read that. This sounds rather a DCP problem than a problem for the cell browser.

The csv files don't seem to contain these meta data strings.

On Thu, Mar 7, 2019 at 4:33 PM Matt Speir notifications@github.com wrote:

Would another tool be needed? It looks like scanpy is able to extract and store the metadata based on lines like: ... storing 'construction_approach_label' as categorical

Maybe this is stored in the resulting 'anndata.h5ad' file in the cbScanpy output directory? Is there a way I could check that?

Through the DCP there's not really an easy way to get the metadata yet, though I think that's going to change in the near future. For this specific loom file (Single_cell_transcriptome_analysis_of_human_pancreas.loom), I have the same information in csv format in the cells.csv, genes.csv, and expression.csv files in the directory /hive/users/mspeir/cellbrowserTest/pancreas/new_metadata_test/matrix_files/Single_cell_transcriptome_analysis_of_human_pancreas.csv .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/78#issuecomment-470572605, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TYd_teRQN6DSrAkjVzNPsOck_Ltyks5vUTE2gaJpZM4bfhzp .

matthewspeir commented 5 years ago

The 'cells.csv' and 'genes.csv' files contain almost all of the metadata fields listed in my first note.

$ head -n1 cells.csv | tr "," "\n"
cellkey
genes_detected
donorkey
genus_species_ontology
genus_species_label
ethnicity_ontology
ethnicity_label
disease_ontology
disease_label
development_stage_ontology
development_stage_label
organ_ontology
organ_label
organ_part_ontology
organ_part_label
librarykey
input_nucleic_acid_ontology
input_nucleic_acid_label
construction_approach_ontology
construction_approach_label
end_bias
strand
short_name
protocol
bundle_uuid

And

$ head -n1 genes.csv | tr "," "\n"
featurekey
featurename
featuretype
chromosome
featurestart
featureend
isgene
maximilianh commented 5 years ago

Ohh! Sorry, I don't know what I was thinking. Nothing I guess.

Yes, in this case, what you'd do, and I see it's not obvious it all: you treat cells.csv as the meta data (no need for the gene metadata).

Your cbScanpy run gives you scanpy-related meta data.

You then combine both meta files using "cbTool metaCat".

This is a typical example of meta data combining, explained here: https://cellbrowser.readthedocs.io/combine.html

Shall I better document this somehow? I don't know how or where.

also we need to document cbMarkerAnnotate somewhere... but that's unrelated.

On Thu, Mar 7, 2019 at 5:04 PM Matt Speir notifications@github.com wrote:

The 'cells.csv' and 'genes.csv' files contain almost all of the metadata fields listed in my first note.

$ head -n1 cells.csv | tr "," "\n" cellkey genes_detected donorkey genus_species_ontology genus_species_label ethnicity_ontology ethnicity_label disease_ontology disease_label development_stage_ontology development_stage_label organ_ontology organ_label organ_part_ontology organ_part_label librarykey input_nucleic_acid_ontology input_nucleic_acid_label construction_approach_ontology construction_approach_label end_bias strand short_name protocol bundle_uuid

And

$ head -n1 genes.csv | tr "," "\n" featurekey featurename featuretype chromosome featurestart featureend isgene

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/78#issuecomment-470585296, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TQpOjvLmR2xAzR6GPUKw5KY93xikks5vUTh9gaJpZM4bfhzp .

matthewspeir commented 5 years ago

After your latest update, running cbScanpy on the same loom file fails with the following error:

... storing 'featuretype' as categorical
Traceback (most recent call last):
  File "/cluster/home/mspeir/miniconda3/bin/cbScanpy", line 10, in <module>
    sys.exit(cbScanpyCli())
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3748, in cbScanpyCli
    adata = cbScanpy(matrixFname, confFname, figDir, logFname, matrixOutFname)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 3532, in cbScanpy
    sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.4, multi_panel=True)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/scanpy/plotting/_anndata.py", line 622, in violin
    'Did not find {} in adata.obs_keys().'.format(key))
ValueError: Either use observation keys or variable names, but do not mix. Did not find n_counts in adata.obs_keys().

Command:

cbScanpy -o cbScanpyOut_pancreas_aging_loom_v2 -s -n HCA_Pancreas_Aging_Loom -e Single_cell_transcriptome_analysis_of_human_pancreas.loom 

Input file:

/hive/users/mspeir/cellbrowserTest/pancreas/new_metadata_test/matrix_files/Single_cell_transcriptome_analysis_of_human_pancreas.loom 
maximilianh commented 5 years ago

Ah, darn, I thought this change wouldn't affect the import... looking...

matthewspeir commented 5 years ago

Thanks for looking into it, Max!

maximilianh commented 5 years ago

No need for thanking, Brian Lee is not listening and I messed it up... :)

maximilianh commented 5 years ago

OK, should be fixed, release 0.4.53, thanks!

On Mon, Mar 11, 2019 at 8:07 PM Maximilian Haeussler maximilianh@gmail.com wrote:

No need for thanking, Brian Lee is not listening and I messed it up... :)

matthewspeir commented 5 years ago

I think we can close this now. I can confirm that the meta.tsv contains the metadata from the input loom file.