phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
80 stars 18 forks source link

assert exists(clones_file) #20

Closed afcmalone closed 2 years ago

afcmalone commented 3 years ago

Hi,

your recent paper was great!

I am failing to get output using run_conga.py.

I get an error: assert exists(clones_file) AssertionError

seems to run ok intially, then: Session information updated at 2021-08-24 14:43

reading: /home/malone/Andrew/conga_data/Bx68/Bx68B1_filtered_feature_bc_matrix.h5 of type 10x_h5 Variable names are not unique. To make them unique, call .var_names_make_unique. total barcodes: 27073 (27073, 33538) Traceback (most recent call last): File "conga/scripts/run_conga.py", line 367, in adata = conga.preprocess.read_dataset( File "/home/malone/python_code/conga/conga/conga/preprocess.py", line 296, in read_dataset assert exists(clones_file) AssertionError

command line is: python conga/scripts/run_conga.py --graph_vs_graph --gex_data /home/malone/Andrew/conga_data/Bx68/Bx68B1_filtered_feature_bc_matrix.h5 --gex_data_type 10x_h5 --clones_file /home/malone/Andrew/mgi_vdj/Bx68_B1_vdj_t_outs/filtered_contig_annotations_tcrdist_clones.tsv --organism human --outfile_prefix tmp_hs_conga

Any ideas why this is occurring??

Thanks

phbradley commented 3 years ago

Hi Andrew, thanks for trying CoNGA and thanks in advance for your patience as we get this figured out. It looks like it can't find the clones file. Can you double-check that it exists at the location you are passing in (/home/malone/Andrew/mgi_vdj/Bx68_B1_vdj_t_outs/filtered_contig_annotations_tcrdist_clones.tsv)? Note that there are two steps: running setup_10x_for_conga.py which makes the clones file and then running run_conga.py. Partly for historical reasons and partly so that you can check out the TCR clonotype parsing and confirm that they seem reasonable. The README has examples of this near the top.

afcmalone commented 3 years ago

Thank you for your quickly reply. I realized this mistake as soon as i messaged you! The pipeline works now, however, i have run into some other issues. My data is very sparse particularly on the TCR clone side of things. Is there a minimum number of clones or sizes of clones needed to get a complete set of results? e.g, I have many different clones but most clones are n=1 or 2. Thanks

sschattgen commented 3 years ago

Hi Andrew. The size of the clone is irrelevant since each clone is being represented by a single cell from the lineage. Based on what you're describing, I suspect the issue is the conga cluster sizes in your results are below the threshold for generating the full graphical output for the graph-vs-graph analysis. The default is set to 5 but can be specified using --min_cluster_size argument. Since you said it's a small dataset you could try dropping it to the 2 (the minimum size that can be used) and see if that works. If not, we may need a little more information about the error.

afcmalone commented 3 years ago

This worked, thank you.

afcmalone commented 3 years ago

Conga is a wealth of information! - One more question. Is the conga_final.h5ad object in a format that can be used in scanpy like a regular single cell RNA-seq anndata object. e.g. to make GEX violin plots etc?

phbradley commented 3 years ago

I would say yes, with the caveat that each clonotype has been reduced to a single representative cell. So for expanded clones there is some loss of information. If you add the option --average_clone_gex to the run_conga.py script then the GEX information for that representative cell will be the average of the GEX information for all the cells in the clone. The _final.h5ad object has both the raw counts data (normalized to sum to 10000 for each cell and then log-transformed) in the adata.raw.X matrix and also the scaled and regressed GEX data for just the highly variable genes in the adata.X matrix. Anyhow, it should be basically OK for scanpy GEX type analyses.

Thanks for your kind words re conga. Don't forget to try the --all option to run_conga.py if you haven't already, to get even more output!

sschattgen commented 3 years ago

Just echoing Phil here, but the conga_final.h5ad object is compatible with scanpy's plotting and statistical tools, though we haven't exhaustively tried all of them. It's possible to generate and save an object with all the cells prior to reducing it down to single-cell per clonotype for conga analysis so in the end, you would have two objects that were processed through the same pipeline which might be preferable. This is currently not possible using the run_conga script but it can be easily done interactively. Here is an example of how to you could make the two objects through the conga workflow that can be used with scanpy.

conga_full_and_reduced_example.zip

afcmalone commented 3 years ago

This is working great for me now. really powerful tool! i am now trying to include ADT/CITEseq data into the pipeline by adding --include_protein_features

my clone data files are from 10X VDJ T cell V2.

I ran: ../anaconda3/envs/conga_new_env/bin/python conga/scripts/run_conga.py --graph_vs_graph --gex_data ../Andrew/conga_data/hs1_mtx --gex_data_type 10x_mtx --clones_file ../Andrew/conga_data/citeseq_t_contigs_filtered_contig_annotations_tcrdist_clones.tsv --organism human --outfile_prefix ../Andrew/conga_data/tmp_hs_citeonly_conga_clone5 --min_cluster_size 5 --include_protein_features

however i am getting an error: reading: ../conga_data/hs1_mtx of type 10x_mtx ../python_code/conga_conda/conga/conga/preprocess.py:221: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future. if adata.isview: # ran into trouble with AnnData views vs copies total barcodes: 6978 (6978, 23215) Traceback (most recent call last): File "conga/scripts/run_conga.py", line 374, in suffix_for_non_gene_features = args.suffix_for_non_gene_features, File "../python_code/conga_conda/conga/conga/preprocess.py", line 296, in read_dataset assert exists(clones_file) AssertionError

Also, I would like to run my 10X VDJ_B_cell (Ig) data using this pipeline. You mention this functionality - 2020-09-04: (EXPERIMENTAL) Not sure how much hacking in need to do to the code or whether i would be successful.

Thank you in advance,

phbradley commented 3 years ago

Hi Andrew, I think we can get both of these to work; but, you are right-- they are experimental and haven't been tested very much.

First, though, it looks like it can't find the clones file (same error as before), maybe you can double-check the location for the clones file?

For the B cell stuff, you might look first at the example in the README and see if that can be tweaked to work for your case: https://github.com/phbradley/conga#human-melanoma-b-cell-dataset

Note that in that example, we are defining "clonotypes" for B cells using a bcrdist threshold of 50.

afcmalone commented 3 years ago

really sorry but i pasted the previous error (now resolved)

error with --include_protein_features : Traceback (most recent call last): File "/home/malone/anaconda3/envs/conga_new_env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/malone/python_code/conga_conda/conga/scripts/run_conga.py", line 518, in adata, compare_distance_distributions=True) File "/home/malone/python_code/conga_conda/conga/conga/preprocess.py", line 593, in calc_X_pca_gex_including_protein_features prot_mask = np.array(adata.raw.var[ft_varname] == 'Antibody Capture') File "/home/malone/anaconda3/envs/conga_new_env/lib/python3.6/site-packages/pandas/core/frame.py", line 2906, in getitem indexer = self.columns.get_loc(key) File "/home/malone/anaconda3/envs/conga_new_env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: None

Maybe the problem is because I am reading in hs1_mtx files to the conga_run pipeline?

Thanks,

sschattgen commented 3 years ago

10X matrix files are supported input for the pipeline so this is not the issue. You are getting this error because nothing is assigned as 'Antibody Capture' in the features.tsv file. The typical cellranger output should include this information but I suspect here that you exported the raw count matrix from a Seurat object using the example in the README file. If this was the case, the current example was intended for GEX only and would cause this error. You would need to write out a matrix containing both the gene and antibody labeling counts, and also make sure these are properly labelled in the features.tsv file. Here's an example of how you can do this with the write10xCounts function in DropletUtils:

## exporting gene expression and antibody labeling from Seurat to CoNGA
require(Seurat)
require(DropletUtils)
hs1 <- readRDS('./my_seurat_obj.rds')

## concatenate the GEX and antibody labeling count matrices into one matrix

# here, ADT is the antibody labeling assay slot
count_matrix <- rbind(hs1@assays$RNA@counts, hs1@assays$ADT@counts)

# create vector of feature type labels 
features <- c(
  rep("Gene Expression", nrow(hs1@assays$RNA@counts)), 
  rep("Antibody Capture", nrow(hs1@assays$ADT@counts))
  )

# write out              
write10xCounts( count_matrix, 
                path = './raw/seurat_export/',
                gene.id = rownames(count_matrix),
                gene.symbol = rownames(count_matrix),
                barcodes = colnames(count_matrix),
                gene.type = features,
                version = "3")
afcmalone commented 3 years ago

Thanks Stefan, this works nicely. I have been using an integrated seurat object (5 samples) to generate the hs1_mtx as input into the conga_run pipeline. In your updates you added a /merge_samples.py method to run conga on multiple samples. What I did was to manually combine the clones file correcting the barcodes to match the integrated seurat object/hs1_mtx files. Do you think I should use 'hs1@assays$RNA@data' instead of 'hs1@assays$ADT@counts' ? These this matrix is normalized, or would this make no difference once the conga pipeline is run?

Thanks, Andrew

sschattgen commented 3 years ago

I would stick with the count matrix. It will be normalized and scaled in the pipeline.

afcmalone commented 3 years ago

I have been doing this. Thanks again for all your answers.

phbradley commented 3 years ago

Thanks for being a beta tester! And please do follow up if you run into trouble or have suggestions for improvements/new features.