phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
80 stars 18 forks source link

Empty clones_file #47

Closed kkearns13 closed 2 years ago

kkearns13 commented 2 years ago

Hi!

I am trying to generate a clones_file using setup_10x_for_conga.py but the .tsv file comes out empty even though there are around 1500 barcodes that appear more than once with TRA/TRB information. The contig files I have include TRA/TRB as well as TRG/TRD, but I'm not sure if that would result in errors with generating the clones_file.

The contig file I am using was generated from TRUST4 - here is a snippet of it: Screen Shot 2022-07-25 at 4 17 47 PM

As well as some barcodes that appear more than once in the file above: Screen Shot 2022-07-25 at 4 29 13 PM

And the code I used: python setup_10x_for_conga.py --filtered_contig_annotations_csvfile 10X_report_t.csv --organism human --no_kpca

Please let me know if you need other information. Any help is much appreciated! Thank you!

phbradley commented 2 years ago

Hi there, thanks for trying conga! It looks like the "raw_clonotype_id" column contains "None" in the snippets you've showed. Are there any non-None values? I think in a typical 10x contigs file that column would have unique identifiers for real clonotypes, and right now the code filters out lines that have None.

phbradley commented 2 years ago

If indeed the "raw_clonotype_id" column contains only None, or at any rate if None values should not be ignored, you could try putting the cell barcode into the "raw_clonotype_id" column. I think that should be pretty good, but it may drop a few more cells than in a normal 10x contigs file, because of the way "clonotypes" are treated in the code. If you do try that, I'd love to hear how it goes, for example if you could share the log file that the setup_10x_for_conga.py script produces...

kkearns13 commented 2 years ago

Thank you! I added the cell barcodes to the raw_clonotype_id columns and tried running the setup_10x_for_conga.py script again. I'm running into another error though:

Traceback (most recent call last):
File "/mnt/BioHome/kkearns/conga/scripts/setup_10x_for_conga.py", line 81, in <module> 
  make_10x_clones_file( 
File "/mnt/BioHome/kkearns/conga/conga/tcrdist/make_10x_clones_file.py", line 699, in make_10x_clones_file
  clonotype2tcrs, clonotype2barcodes = read_tcr_data(
File "/mnt/BioHome/kkearns/conga/conga/tcrdist/make_10x_clones_file.py", line 154, in read_tcr_data 
  vg = fixup_gene_name(l.v_gene, gene_suffix, expected_gene_names)
File "/mnt/BioHome/kkearns/conga/conga/tcrdist/make_10x_clones_file.py", line 33, in fixup_gene_name
  assert vj in 'VJ'
AssertionError

I thought it might have been due to the vdj gene columns including a "*01" or something similar at the end of the gene names, so I tried removing that to make it look like the 10x file example but that didn't work either. Do you have an idea what might be causing this issue?

phbradley commented 2 years ago

Hi there, thanks for your patience! Assuming you are running with --organism 'mouse' or 'human' (ie, a/b tcrs), then I'm guessing that there is a gene name in the 'v_gene' or 'j_gene' column that's a little wonky. In this case, it's complaining that the part of the gene name (character 4) that is usually V (eg. TRAV1-2) or J (e.g. TRAJ33) is actually something else. Perhaps you could look at the TRUST output to see if there are any funny v or j gene names? This would be in a line that has 'TRA' or 'TRB' in the 'chain' column. I don't think the '*01' should matter, but that was a good idea to check that!

kkearns13 commented 2 years ago

Thank you for the quick response! The file has some gd TCR genes as well - would these be causing the issue? There are also some lines that say "None" which might also be weird?

phbradley commented 2 years ago

Huh, I don't think the gd TCR genes should be a problem, but it might be safer to exclude them. And yes, I think if there are 'None' in the v_gene or j_gene column that could cause trouble. At least, a None for the v_gene (or j_gene) in a line that is otherwise OK (productive, non-None CDR3, etc). I guess that must not happen in standard 10x contigs files (or we haven't run into it yet). ANyhow, I would try removing the None-in-v_gene or None-in-j_gene lines and re-try it.

phbradley commented 2 years ago

I just pushed a change to the code that should improve the handling of this case. If you want, you could pull the latest version and see how it goes...

kkearns13 commented 2 years ago

Thank you so much for the help! I removed the lines with "None" in the v_gene and j_gene columns and looks like it worked!! The .tsv file has paired abTCRs (despite gdTCRs also in the file) so I'll try the rest of the code to see if I can make the plots etc. I haven't pulled the latest version yet.

I'll also try to use the other flag to look at gdTCRs as well since that is what I'm more interested in ;) Is there any way to look at both abTCRs and gdTCRs at the same time (e.g., generate TCR clusters etc.)? I took a look at the fancy_conga_pipeline_with_batches_and_gammadelta_tcrs.ipynb, and I'm not sure if I understand it completely but it seems to me that there are two separate h5ad objects for abTCRs and gdTCRs?

Thank you again!

kkearns13 commented 2 years ago

By the way, this was one of the log files I got out:

unrecognized J gene: human TRDJ1*01
unrecognized J gene: human TRDJ1*01
unrecognized J gene: human TRDJ1*01
unrecognized J gene: human TRDJ1*01
unrecognized J gene: human TRDJ1*01
unrecognized J gene: human TRDJ1*01
ab_counts: [((1, 1), 2401)]
old_unpaired_barcodes: 9681 old_paired_barcodes: 2644 new_stringent_paired_barcodes: 2401
Skipping TCRdist calculations and kernel PCA
If this all worked you should be able to pass /mnt/BioHome/kkearns/mtb20_dn/conga/conga_10x_formats/MTB20L_conga_format_tcrdist_clones.tsv as the --clones_file argument to run_conga.py
DONE