phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
79 stars 18 forks source link

Issue with logo plots #62

Open guillemsanchezsanchez1996 opened 1 year ago

guillemsanchezsanchez1996 commented 1 year ago

Hello everybody,

First of all thanks a lot Conga team for creating this nice and cool package, and for keeping in mind the gd TCR "aficionados" in your work. I am struggling to run the make_tcr_logos.py script with a gd tsv file. I have edited it to include the "human_gd" variable in organism. The error is the following one:

python scripts/make_tcr_logos.py --tcrs_tsvfile data/CD4_naive_1.tsv --outfile_prefix CD4_naive_1_2 --organism human_gd Read 321 paired TCRs from data/CD4_naive_1.tsv made: CD4_naive_1_2_tcr_logo_A.png Traceback (most recent call last): File "/home/willy_s/conga/scripts/make_tcr_logos.py", line 68, in make_tcr_logo_for_tcrs( File "/home/willy_s/conga/conga/tcrdist/make_tcr_logo.py", line 515, in make_tcr_logo_for_tcrs cmds = make_default_logo_svg_cmds( File "/home/willy_s/conga/conga/tcrdist/make_tcr_logo.py", line 376, in make_default_logo_svg_cmds b_junction_results = tcr_sampler.analyze_junction( organism, vb_gene, jb_gene, File "/home/willy_s/conga/conga/tcrdist/tcr_sampler.py", line 401, in analyze_junction assert 3*len(cdr3_protseq) == len(ncount) AssertionError

Do yo have an idea about what is going on? I think the problem is with the delta sequence logo.

Guillem

PS: Here a snapshot of my tcr file, I have not seen any strange sequence (i.e CDR3 with very few aminoacids) imatge

sschattgen commented 1 year ago

Hi Guillem,

Thanks for your interest.

The issue is due to the nucleotide sequence not being equal to 3 times the amino acid sequence. It seems you have extra/missing nucleotides or amino acids somewhere in the table. You can use pandas and this bit of code to find which ones are causing the error.

import pandas as pd
df = pd.read_table('your_table.tsv')

df[
    (3*df.cdr3a.str.len() != df.cdr3a_nucseq.str.len()) | 
    (3*df.cdr3b.str.len() != df.cdr3b_nucseq.str.len())
]