phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
80 stars 18 forks source link

Just make Logo plots starting from TCR data #54

Open erosix opened 1 year ago

erosix commented 1 year ago

Hello, I have some groups of single cell TCRs for which I would like to make Logo plots including both alpha and beta chains and the usage of V and J genes as it is done by Conga.

By any chance is there some fast to use snipet code to make the Logo plots that can save me the time to cherry pick the parts I need from the conga code? :-D

phbradley commented 1 year ago

Hi there, I added a little script to the scripts/ directory: make_tcr_logos.py which reads a TSV-formatted file with TCR information and makes SVG and PNG-formatted TCR logos. You will need to pull the latest version to get it. You would run something like:

python ~/gitrepos/conga/scripts/make_tcr_logos.py --tcrs_tsvfile tcrs.tsv --outfile_prefix tcrs_test --organism human

where tcrs.tsv started something like this:

va  ja  cdr3a   cdr3a_nucseq    vb  jb  cdr3b   cdr3b_nucseq
TRAV1-2*01  TRAJ33*01   CAVSDSNYQLIW    tgtgctgtgagtgatagcaactatcagttaatctgg    TRBV6-4*01  TRBJ2-1*01  CASSDGQPNNEQFF  tgtgccagcagtgatggacagcctaacaatgagcagttcttc

Or you can just look at the source code to see what conga pieces are being used. Let me know if you have any questions. Hope that helps!

sschattgen commented 1 year ago

Hi! Phil be me to the punch while I was putting it together, but I figured I'd share mine too. It's a nearly identical solution but intended for use in an interactive session (e.g. jupyter notebook) using a tab-delimited file containing your tcr sequences. The required columns are defined by tcr_keys.

%%

from conga.tcrdist.make_tcr_logo import make_tcr_logo_for_tcrs from conga.tcrdist.tcr_distances import TcrDistCalculator import pandas as pd gene_file = '~/conga/conga/tcrdist/db/combo_xcr.tsv' gene_df = pd.read_csv(gene_file, sep = '\t') gene_df = gene_df[gene_df.organism == 'human']

tcr_keys = ('va','ja','cdr3a','cdr3a_nucseq', 'vb','jb','cdr3b','cdr3b_nucseq')

def retrieve_tcrs(df): tcrs = [] arrays = [ df[x] for x in tcr_keys ] for va,ja,cdr3a,cdr3a_nucseq,vb,jb,cdr3b,cdr3b_nucseq in zip(*arrays): tcrs.append(((va, ja, cdr3a, cdr3a_nucseq.lower()), (vb, jb, cdr3b, cdr3b_nucseq.lower())) )
return tcrs

tcrdist_calculator = TcrDistCalculator('human')

%% read table of tcrs for the logo and clean up

tcr_df = pd.read_csv('logo_test.tsv', sep = '\t').loc[:,tcr_keys].drop_duplicates()

for col in tcr_keys: assert col in tcr_df, f'Need column {col}'

allele information is required. add if missing for gene in ['va','ja','vb','jb']: tcr_df[gene] = tcr_df[gene] + "*01"

tcr_df = tcr_df[(tcr_df.vb.isin(gene_df.id)) & (tcr_df.jb.isin(gene_df.id)) & (tcr_df.va.isin(gene_df.id)) & (tcr_df.ja.isin(gene_df.id))]

%% pull tcrs from your df and make logos for alpha and beta chains

tcrs = retrieve_tcrs(tcr_df)

for chain in "AB": pngfile = f"testlogo{chain}_chain.png" make_tcr_logo_for_tcrs( tcrs, chain, 'human', pngfile, tcrdist_calculator=tcrdist_calculator )

phbradley commented 1 year ago

Nice! Thanks Stefan!!!

erosix commented 1 year ago

Not even one but two solution, great! Thank you both, looking forward to try them out!

guillemsanchezsanchez1996 commented 1 year ago

Hello everybody,

First of all thanks a lot Conga team for creating this nice and cool package, and for keeping in mind the gd TCR "aficionados" in your work. I am struggling to run the make_tcr_logos.py script with a gd tsv file. I have edited it to include the "human_gd" variable in organism. The error is the following one:

python scripts/make_tcr_logos.py --tcrs_tsvfile data/CD4_naive_1.tsv --outfile_prefix CD4_naive_1_2 --organism human_gd Read 321 paired TCRs from data/CD4_naive_1.tsv made: CD4_naive_1_2_tcr_logo_A.png Traceback (most recent call last): File "/home/willy_s/conga/scripts/make_tcr_logos.py", line 68, in make_tcr_logo_for_tcrs( File "/home/willy_s/conga/conga/tcrdist/make_tcr_logo.py", line 515, in make_tcr_logo_for_tcrs cmds = make_default_logo_svg_cmds( File "/home/willy_s/conga/conga/tcrdist/make_tcr_logo.py", line 376, in make_default_logo_svg_cmds b_junction_results = tcr_sampler.analyze_junction( organism, vb_gene, jb_gene, File "/home/willy_s/conga/conga/tcrdist/tcr_sampler.py", line 401, in analyze_junction assert 3*len(cdr3_protseq) == len(ncount) AssertionError

Do yo have an idea about what is going on? I think the problem is with the delta sequence logo.

Guillem

PS: Here a snapshot of my tcr file, I have not seen any strange sequence (i.e CDR3 with very few aminoacids)

imatge