Amino Acid sequence inputs

sidhomj commented 5 years ago

Can tcrdist take cdr3 amino acid sequences as inputs or does it only take the complete nucleotide sequence covering cdr1,2, and 3?

jeremycfd commented 5 years ago

Currently we only support the use of nucleotide sequence, as the nucleotide information is used for a number of things downstream (e.g., probability of generation calculations). There are ways to get around this in the current pipeline with some creative thinking, and it is something I've done for people on occasion, but it's not something currently supported by the version of the pipeline on GitHub.

sidhomj commented 5 years ago

And does it need the cdr1,cdr2 regions as well?

Sent from my iPhone

On Feb 27, 2019, at 9:33 AM, jeremycfd notifications@github.com<mailto:notifications@github.com> wrote:

Currently we only support the use of nucleotide sequence, as the nucleotide information is used for a number of things downstream (e.g., probability of generation calculations). There are ways to get around this in the current pipeline with some creative thinking, and it is something I've done for people on occasion, but it's not something currently supported by the version of the pipeline on GitHub.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/phbradley/tcr-dist/issues/30#issuecomment-467883642, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AarMzHBmlDcHMC9CJbSeioD7Sbpbewwvks5vRpcFgaJpZM4bTyNW.

phbradley commented 5 years ago

No-- it can look that up from the gene ids. And we can work on improving the I/O: as Jeremy said, the nucleotide sequence has been the traditional starting point, partly to guarantee consistency on gene names, and also because parts of the analysis (probability calculations, V(D)J junction analysis) rely on the nucleotides. But of course the TCRdist calculations don't need them, and we could make the trees without the junction bars (which show likely rearrangement scenarios) if nucleotide info isn't present. Do you have suggestions on input formats that only include amino acids? Is the VDJtools pretty standard? We started work on parsing tools that would accept other formats but it's hard to get motivated without a known user base!

sidhomj commented 5 years ago

I think many people using these TCR analytic tools get their data from Adaptive Biotechnologies. Their data portal has examples of what the data input looks like. I would also have the input generally be amino acid sequence of the cdr3 as this is considered generally the most informative part of the sequence for specificity. For example, the GLIPH dataset only has reported cdr3 amino acid sequences so if one wanted to run TCRdist on that dataset, it would be currently be difficult to do so.

phbradley commented 5 years ago

Those are good suggestions. Unfortunately we definitely can't do without the V gene information-- that's a core component of the TCRdist measure. I have done a lot of work with the Adaptive files, and I agree that would be a good thing to try. The only issue I see is that (in my experience) the Adaptive file format is for unpaired data, which is not really the main focus of TCRdist (although we have been using it more and more for single-chain analyses as well). Is there a standard paired Adaptive format, do you know?

jeremycfd commented 5 years ago

The best thing would be for Adaptive to actually provide the relevant sequence information rather than force people to use data that has been processed bioinformatically (and potentially sub-optimally). What happens when you try to compare old and newer datasets, and Adaptive has updated their reference between generating them? (IMGT has updates often, so we know it -should- be changing.) We also use the sequence information for things like inferring true clones, although that's generally more relevant to data produced by Sanger.

On Wed, Feb 27, 2019 at 2:01 PM Philip Bradley notifications@github.com wrote:

Those are good suggestions. Unfortunately we definitely can't do without the V gene information-- that's a core component of the TCRdist measure. I have done a lot of work with the Adaptive files, and I agree that would be a good thing to try. The only issue I see is that (in my experience) the Adaptive file format is for unpaired data, which is not really the main focus of TCRdist (although we have been using it more and more for single-chain analyses as well). Is there a standard paired Adaptive format, do you know?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/phbradley/tcr-dist/issues/30#issuecomment-468008838, or mute the thread https://github.com/notifications/unsubscribe-auth/ASp2SC3fMHe8TgnlWzquIz1bJgmDx4eFks5vRuQUgaJpZM4bTyNW .

sidhomj commented 5 years ago

got it. I definitely understand the hurdles. One may argue for broad use, the nuances related to processing may not change the conclusions of the analyses at this point.

John-William Sidhom, MSE MD/PhD Candidate MII/GV Bloomberg-Kimmel Institute for Cancer Immunotherapy Department of Biomedical Engineering Johns Hopkins University School of Medicine 908.418.3251| jsidhom1@jhmi.edumailto:jsidhom1@jhmi.edu

From: jeremycfd notifications@github.com Sent: Wednesday, February 27, 2019 3:11:25 PM To: phbradley/tcr-dist Cc: John-William Sidhom; Author Subject: Re: [phbradley/tcr-dist] Amino Acid sequence inputs (#30)

The best thing would be for Adaptive to actually provide the relevant sequence information rather than force people to use data that has been processed bioinformatically (and potentially sub-optimally). What happens when you try to compare old and newer datasets, and Adaptive has updated their reference between generating them? (IMGT has updates often, so we know it -should- be changing.) We also use the sequence information for things like inferring true clones, although that's generally more relevant to data produced by Sanger.

On Wed, Feb 27, 2019 at 2:01 PM Philip Bradley notifications@github.com wrote:

Those are good suggestions. Unfortunately we definitely can't do without the V gene information-- that's a core component of the TCRdist measure. I have done a lot of work with the Adaptive files, and I agree that would be a good thing to try. The only issue I see is that (in my experience) the Adaptive file format is for unpaired data, which is not really the main focus of TCRdist (although we have been using it more and more for single-chain analyses as well). Is there a standard paired Adaptive format, do you know?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/phbradley/tcr-dist/issues/30#issuecomment-468008838, or mute the thread https://github.com/notifications/unsubscribe-auth/ASp2SC3fMHe8TgnlWzquIz1bJgmDx4eFks5vRuQUgaJpZM4bTyNW .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/phbradley/tcr-dist/issues/30#issuecomment-468012153, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AarMzLC9AK1eB7ULl1e75fnNiqsrDJfeks5vRuZtgaJpZM4bTyNW.

phbradley / tcr-dist

Amino Acid sequence inputs #30