mikessh / vdjtools-doc

THIS REPOSITORY IS DEPRECATED, FOR ISSUE TRACKER GO TO
https://github.com/mikessh/vdjtools
1 stars 2 forks source link

Disregard the wobble base #1

Open MoentyPython opened 8 years ago

MoentyPython commented 8 years ago

Hey,

thank you for this amazing tool, I have included it in out pipeline and it works perfectly. I just have one issue I'd like to bring up.

Sometimes, it is not the nucleotide (NT) sequence that we are interested in but the amino acid (AA) sequence. When I parse my IMGT files with VDJtools my output file lists all my sequences in order of highest expansion. However, due to the wobble base sometimes several NT sequences code the same AA sequence. When we now look at the top 20 AA-sequences we 1. dont have the actual count, as there might be several other NT sequences coding for it aswell, 2. might be missing AA sequences that would have made it into the top 20 if all there AA-sequences would have been summed up.

I have written my own script to deal with this, however it takes a long time to process 500k lines. I wonder if there is a option to disregard the NT-sequence or if it is possible to include something like that in future versions?

Best wishes,

Solaiman

mikessh commented 8 years ago

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

  1. If these NT sequences coding for the same AA are erroneous sub-variants, you can remove them using Correct utility, more details here. Note that correct has a certain error threshold (-r option) and will not scan variants that differ by more than 2 mismatches (however in case there is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3 -other-mismatch-> 4, those will be clustered together).
  2. You can try to pool by amino acid using PoolSamples -i aa ..., more details here. There is a small problem here - pool needs 2+ samples, but you can actually use the [metadata]() file, providing a single metadata file with a single entry per each sample (-m metadata.txt option). I'll fix this limitation in the next release.

Remember that prior to any of those operations you should convert your file to VDJtools format using Convert.

P.S. Issues should be reported to https://github.com/mikessh/vdjtools/issues, this is just a doc repository

Best, Mike

MoentyPython commented 8 years ago

Dear Mikhail,

thank you for your quick response! The metadata file trick did it for me, easy to write a script to automate it! Thank you.

I have another quick question regarding the OverlapPair http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#overlappair function. We would like to calculate the pearsonscore for pairs of samples, however, a PostDoc in our lab told me it is important for our analysis, that we include the non-overlapping sequences for our experimental setup. We read under the description of your clonotype scatterplot, that it only uses overlapping clonotypes for calculation.

Firstly, is there a specific reason you exclude the non-overlapping pairs, especially in the HSCT example given, it would be much more interesting to see how similar/dissimilar the entire repertoire of one patient is to the next.

Secondly, is there an option im missing, to include non-overlapping pairs?

Thank you!

2016-06-29 18:41 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Thanks for your feedback! There are two possible workarounds:

1) If these NT sequences coding for the same AA are erroneous sub-variants, you can remove them using Correct utility, more details here http://vdjtools-doc.readthedocs.io/en/latest/preprocess.html#correct. Note that correct has a certain error threshold (-r option) and will not scan variants that differ by more than 2 mismatches (however in case there is an intermediate variant, i.e. 1 -mismatch-> 2 -other-mismatch -> 3 -other-mismatch-> 4, those will be clustered together). 2) You can try to pool by amino acid using PoolSamples -i aa ..., more details here http://vdjtools-doc.readthedocs.io/en/latest/operate.html#poolsamples. There is a small problem here - pool needs 2+ samples, but you can actually use the metadata file, providing a single metadata file with a single entry per each sample (-m metadata.txt option). I'll fix this limitation in the next release.

Remember that prior to both steps you should convert your file to VDJtools format using Convert.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mikessh/vdjtools-doc/issues/1#issuecomment-229415081, or mute the thread https://github.com/notifications/unsubscribe/AQIyXErVKBZz7ZL6D9zdc3UBgJ5Me4VYks5qQqAlgaJpZM4JBTke .

mikessh commented 8 years ago

Dear Solaiman,

Its not very clear for me, what do you mean under "including non-overlapping clonotypes". If it is for R score computation, then including a set of, say 1000, points that have 0 frequency in one sample and non-zero frequency in other can bias the estimate. I mean people frequently use correlation in bioinformatics for random variables that can lead to non-robust behavior (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness), but in this case it can be wise to judge missing clonotypes separately based on frequency and number of non-overlapping clonotypes (see `freq12, etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4). Note that non-overlapping clonotype density is still shown in side bars (marginal distributions) of first plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output.

When you track clonotypes from the same individual (e.g. in HSCT), it is of course better to use plots like the second plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output. From statistical point of view, to bind together detected and non-detected clonotypes, as well as clonotype frequency one should have a good idea of whether a change in clonotype count is due to sampling (e.g. you sampled 1000 cells first time and 1010 cells second time) or clonal expansion (you observe 1000 cells first time and 2000 second time). However there is currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence threshold of 1 (http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).

MoentyPython commented 8 years ago

Hey,

thank you for the explanation! To my second question: The JoinSamples function is great! However, I meant if there is anyway to run CalcPairwiseDistances including the non-overlapping clonotypes for calculating the Pearson Score (R value)?

Solai

2016-06-30 18:56 GMT+02:00 Mikhail Shugay notifications@github.com:

Dear Solaiman,

Its not very clear for me, what do you mean under "including non-overlapping clonotypes". If it is for R score computation, then including a set of, say 1000, points that have 0 frequency in one sample and non-zero frequency in other can bias the estimate. I mean people frequently use correlation in bioinformatics for random variables that can lead to non-robust behavior ( https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Robustness), but in this case it can be wise to judge missing clonotypes separately based on frequency and number of non-overlapping clonotypes (see `freq12, etc in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#id4). Note that non-overlapping clonotype density is still shown in side bars (marginal distributions) of first plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output .

When you track clonotypes from the same individual (e.g. in HSCT), it is of course better to use plots like the second plot in http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#graphical-output. From statistical point of view, to bind together detected and non-detected clonotypes, as well as clonotype frequency one should have a good idea of whether a change in clonotype count is due to sampling (e.g. you sampled 1000 cells first time and 1010 cells second time) or clonal expansion (you observe 1000 cells first time and 2000 second time). However there is currently no good model to tell clonal expansions from sampling noise.

As for your last question - you can use JoinClonotypes with occurence threshold of 1 ( http://vdjtools-doc.readthedocs.io/en/latest/operate.html#joinsamples).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mikessh/vdjtools-doc/issues/1#issuecomment-229720639, or mute the thread https://github.com/notifications/unsubscribe/AQIyXIj-ecW6iiMcQ4RhhBZLkEb9P24Wks5qQ_U5gaJpZM4JBTke .