Confusion regarding ntCDR3 / set_CDR3_anchors

qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:

https://qmarcou.github.io/IGoR/

GNU General Public License v3.0

47 stars 25 forks source link

Confusion regarding ntCDR3 / set_CDR3_anchors #7

Open jeremycfd opened 6 years ago

jeremycfd commented 6 years ago

Hi @qmarcou,

I'm a bit confused about use of the ntCDR3 option in standard analysis and have had trouble inferring how exactly to use it in typical analysis by looking through the code. If for instance I have a set of CDR3 sequences that are additionally annotated with V and J information, does --ntCDR3 allow for alignment and downstream analysis (in particular, Pgen calculation) while maintaining the known V/J annotations? Any chance for a brief tutorial on this option included in the demo?

Thanks for your help!

qmarcou commented 6 years ago

Hi @jeremycfd ! For now it is not quite possible to do so, and the --ntCDR3 option only accept sequences CDR3 nt sequences without knowledge of the associated V/J. Of course this is something I plan to add support for in the very near future, however did not want to postpone v1.2.0 release. I'll probably add a small patch to this problem in the next few days, I just need to figure out how to include this in the pipeline in a clean way.

jeremycfd commented 6 years ago

Ah, hrm... I'm still a bit confused on the current implementation of --ntCDR3. I did test out putting in just CDR3 sequences, but I had to decrease "thresh" for it to work. Should I be flagging --ntCDR3 somewhere if I am only using cdr3 sequences?

Thanks!

qmarcou commented 6 years ago

You mean the alignment threshold? It makes perfect sense to lower the alignment score threshold since the number of observable genomic nucleotides is much lower on CDR3 than full read sequences. This is something I did not think about before posting the release, thanks for pointing it out! In theory you should only have to flag --ntCDR3 at the alignment stage, as this option simply uses the genomic anchor indices as alignments offsets. The inference/evaluation should be blind to the type of sequences you use as long as alignments have been provided.

qmarcou commented 6 years ago

Following up on this: I've automated the decrease in alignment threshold for V and J when the --ntCDR3 option is used (and set them to 0 since the alignment offsets are known and sequences are coverage of V and J is short in CDR3 sequence) As for restricting the V/J usage I have started looking at a clean solution, but it turned out to be complicated to implement I'll probably start by implementing a dirtier solution (at the expense of memory usage though...)