qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

Insertions prevent consideration of alignments #5

Open jeremycfd opened 6 years ago

jeremycfd commented 6 years ago

Top-scoring alignments with scores well over the relevant threshold seem to be discarded when they also contain insertions. For instance, the following V alignment will not be included in the n_V_aligns for seq_index 4 in inference_logs.txt: 4;TRAV29/DV502;1128;-4;{52,114};{};{0,111,218,219};275;0;272 But the alignment will be included if the insertions are no longer present: 4;TRAV29/DV502;1128;-4;{};{};{0,111,218,219};275;0;272

Thus far I've only tested this with V alignments, but it seems that even a single insertion, regardless of where it is in the sequence, has this effect. I'm wondering if this is intentional or if I've missed some optional toggle to allow alignment insertions. I'm unsure of the broad implications that this would have, but for my purposes it prevents estimation of Pgen for sequences that do have high-quality alignments but also have putative insertions.

qmarcou commented 6 years ago

Hi Jeremy, Thank you for pointing this out, this is for now intentional as we do not have a probabilistic in/del model. For now alignments with in/dels are discarded by the inference module as a safeguard because it is not clear how one should weight them. However I realize how limiting this might be and will make a quick fix so as these sequences will be taken into account however without proper in/del probabilistic treatment. Would that suit your needs?

jeremycfd commented 6 years ago

Hi @qmarcou! Wanted to follow up on this after thinking about it... So I'm interested in using IGoR for estimating Pgen, but many of the sequences that exist for TCR are fairly low-quality Sanger sequences. We have fairly robust pipelines for annotating and finding the CDR3 regions despite these low-quality bases interspersed throughout the sequence, but I haven't yet figured out how to feed in to your model our parsed annotation/CDR3 info. I've played around with simply setting alignment thresholds to 15 for all segments and deleting all the insertion, deletion, and mismatch information in the IGoR alignment outputs, simply because that allows Pgen to be calculated, but I'm concerned that some of that information is is actually necessary to get an accurate Pgen estimate. Do you have any thoughts on this? Would it be better to just put in the CDR3 nucleotide sequences and set the ---thresh to something low so that it will capture V alignments? That's obviously easy to do, but I worry that as a result of decreasing the amount of V and J sequence for mapping, we would be inferring incorrect V and J segments...

Thanks!

qmarcou commented 6 years ago

Hi, Your intuition is correct and you should probably refrain from editing the alignments results (as their results are used by the inference machinery, and simply removing potential insertions/deletions make them nonsensical). If you want to avoid considering gapped alignments for now the safest would be to set the gap penalty to a very high value (e.g 9999). The other solution is connected to a second issue you have opened #7, and would be to provide directly the V/J templates as inferred from your upstream computational pipeline. Of course this would not be as precise but should still give a reasonable Pgen estimate for your sequences. The cleanest solution remains for me to handle correctly gapped aligments, I have made a bit of progress there but the full probabilistic treatment of these in/dels is still quite far. I'll keep working on this and keep you updated!