How to use tcr-dist with only V gene-level information?

gchoonoo commented 5 years ago

Hello,

I currently only have V gene-level information and the first few rows of my "parsed_seqs_file" looks like this table below. My error comes from the "all_genes.py" script not able to find the gene in this file "alphabeta_db.tsv" because it doesn't have the allele information (example: TRAV26-1 not found, when expecting something like TRAV26-1*01). Is there a way to run tcr-dist without this information? Thank you so much for any advice.

id	epitope	subject	va_gene	ja_gene	vb_gene	jb_gene	cdr3a	cdr3a_nucseq	cdr3b	cdr3b_nucseq	va_reps	ja_reps	vb_reps	jb_reps	va_countreps	ja_countreps	vb_countreps	jb_countreps	cdr3a_quals	cdr3b_quals	va_genes	vb_genes	ja_genes	jb_genes	va_rep	ja_rep	vb_rep	jb_rep
345	Nef	5	TRAV26-1	TRAJ43	TRBV13	TRBJ1-5	CIVRAPGRADMRF	TGCATTGTGCGCGCGCCGGGCCGCGCGGATATGCGCTTT	CASSYLPGQGDHYSNQPQHF	TGCGCGAGCAGCTATCTGCCGGGCCAGGGCGATCATTATAGCAACCAGCCGCAGCATTTT	TRAV26-1	TRAJ43	TRBV13	TRBJ1-5	TRAV26-1	TRAJ43	TRBV13	TRBJ1-5	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	TRAV26-1	TRBV13	TRAJ43	TRBJ1-5	TRAV26-1	TRAJ43	TRBV13	TRBJ1-5
3871	p65	1	TRAV8-6	TRAJ30	TRBV28	TRBJ2-7	CAVSDKNRDDKIIF	TGCGCGGTGAGCGATAAAAACCGCGATGATAAAATTATTTTT	CASRPGTASYEQYF	TGCGCGAGCCGCCCGGGCACCGCGAGCTATGAACAGTATTTT	TRAV8-6	TRAJ30	TRBV28	TRBJ2-7	TRAV8-6	TRAJ30	TRBV28	TRBJ2-7	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	TRAV8-6	TRBV28	TRAJ30	TRBJ2-7	TRAV8-6	TRAJ30	TRBV28	TRBJ2-7
3740	p65	104	TRAV3	TRAJ12	TRBV7-9	TRBJ2-7	CATVSRMDSSYKLIF	TGCGCGACCGTGAGCCGCATGGATAGCAGCTATAAACTGATTTTT	CASSLIGEGTGWHQYF	TGCGCGAGCAGCCTGATTGGCGAAGGCACCGGCTGGCATCAGTATTTT	TRAV3	TRAJ12	TRBV7-9	TRBJ2-7	TRAV3	TRAJ12	TRBV7-9	TRBJ2-7	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	TRAV3	TRBV7-9	TRAJ12	TRBJ2-7	TRAV3	TRAJ12	TRBV7-9	TRBJ2-7
3742	p65	106	TRAV16	TRAJ26	TRBV3-1	TRBJ1-1	CADYYGQNFVF	TGCGCGGATTATTATGGCCAGAACTTTGTGTTT	CASSFQGYTEAFF	TGCGCGAGCAGCTTTCAGGGCTATACCGAAGCGTTTTTT	TRAV16	TRAJ26	TRBV3-1	TRBJ1-1	TRAV16	TRAJ26	TRBV3-1	TRBJ1-1	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99.99	TRAV16	TRBV3-1	TRAJ26	TRBJ1-1	TRAV16	TRAJ26	TRBV3-1	TRBJ1-1

My script run is this:

python2 run_basic_analysis.py --organism human --parsed_seqs_file parsed_seqs_file --make_fake_quals --no_probabilities > parsed_seqs_file.out

-Gabrielle

jeremycfd commented 5 years ago

TCRdist is using allele-level information to calculate probabilities of generation, among other things. So at the stage of processing a parsed_seqs_file, it's looking for chains that can be explicitly found in the current version of the annotation database, which is found in /tcr-dist/db/alphabeta_db.tsv This is one of the reasons it's generally best to parse/annotate your sequences using the TCRdist pipeline so that you aren't introducing different segment names, etc. Or you can annotate your data using whatever other pipeline you have, but use the TCRdist VDJB database for those annotations. If you can't do that, you can try editing the database file and then using the option --no_probabilities (otherwise I would expect your probabilities to be incorrect.)

Having said all of that... I don't quite understand what you mean by "I currently only have V gene-level information". The example data you posted includes J gene information as well.

Feel free to email me directly if you want to discuss further how to get an unsupported data format into TCRdist. It is possible to do, but it's important to take into consideration that some of the standard analyzes presented in the output are going to be biased as a result.

phbradley commented 5 years ago

Hi Gabrielle, What has worked for me in the past is to just add "*01" to all the gene names. There may still be a few that cause trouble, but I would give that a shot for starters. We are in the process of doing some refactoring which should help with this kind of I/O trouble, so I bet we can get this figured out. Let me know whether that fix works for you. Take care, Phil

ALGW71 commented 5 years ago

Hi,

Following on from this question, can you give us an example of how to use parsed seq file with the minimum necessary information in order to run? For example with just the CDR3 sequence, V and J gene - will this work?

Otherwise some more information on the minimum input columns needed to successfully run tcr-dist with limited info.

Many thanks, Alex

phbradley commented 5 years ago

Hi Alex, Sorry that things aren't clearer! This (ie, not starting from the beginning with complete read nucleotide sequences) is definitely one area that needs work. Just to clarify: do you have the nucleotide sequence of the CDR3 regions? Some of the scripts use that information for the graphics. But it's certainly not necessary for all of them... Take care, Phil

ALGW71 commented 5 years ago

Hi Phil,

No worries, it's an ambitious job.

I had moderate success with my raw output "target" sequences from MiXCR, however at short sequence lengths tcr-dist failed to find/classify the CDR3 (where MiXCR seems to be building the CDR3 sequence). I'll keep playing with the CDR3 nucleotide sequences as you have suggested, then see what I can get from my raw MiXCR data. Ultimately as I have the V, J and CDR3 I should be able to "build" a raw sequence which I can input directly into the paired_seq_file (using fake_probabilities and fake_beta or fake_alpha).

If I have success I will put the details here.

Many thanks, Alex

Owuorgpo commented 4 years ago

Hi Gabrielle, What has worked for me in the past is to just add "*01" to all the gene names. There may still be a few that cause trouble, but I would give that a shot for starters. We are in the process of doing some refactoring which should help with this kind of I/O trouble, so I bet we can get this figured out. Let me know whether that fix works for you. Take care, Phil

Was this refactoring introduced? I am faced with the same challenge with 10X data processed using the V(D)J T Cell Analysis with cellranger vdj pipeline in addition to mixcr software. Adding *01 to all genes presents a challenge due to the large dataset and presence of genes like TRAJ43;TRAJ34 or TRAV29/DV5 in the same clonotype. Not sure about the overall effect of editing /tcr-dist/db/alphabeta_db.tsv.

phbradley / tcr-dist

How to use tcr-dist with only V gene-level information? #31