obrzts commented 6 years ago

Hi Quentin

I haven't managed to find any information about how IGoR handles allelic variants presented in models. In default models some IGHV, TRAV and TRBV genes have several alleles (up to 7) and all of them have non-zero probability. Obviously, having more than 2 alleles of one gene in one repertoire is not realistic (if not considering chimeras).

It seems that I have to edit the models manually to have not more than 2 alleles of each gene. How to do it in a proper way? Should I just set to zero the probabilities of all alleles except of the most frequent two and recalculate their probabilities?

Best, Anna

qmarcou commented 6 years ago

Hi Anna, Below are the answers to your questions, hopefully well structured, please tell me if this is unclear:

IGoR's handling of allelic variants.

Origin of IGoR's genomic templates.

The provided genomic templates originally come from the IMGT database, to which some variants that were found upon constructing the generative model on the training dataset were appended. Because people maintaining IMGT wanted to create an exhaustive database the obtained list of alleles comprise many allelic variants that had been found here and there in the population. Because IGoR does not yet ship with an on the fly inference of allelic variants present in the dataset it has to rely on these IMGT variants.

Number of allelic variants.

The biology.

In fact some studies suggest that the TCR and BCR locus are quite dynamic and gene duplication might be common. From Kidd et al. « The inference of phased haplotypes for the immunoglobulin H chain V region gene loci by analysis of VDJ gene rearrangements. », The Journal of Immunology, (2012):

It is also now clear that the apparent heterozygosity that can be seen in genotypes is often a consequence of the carriage of multiple “alleles” on a single chromosome. Such duplication of Ig genes has been reported previously. By employing RFLP analysis with sequence-specific oligonucleotide probes, Sasso and colleagues (30) identified two separate loci for sequences that are now identified as IGHV3-30 and IGHV4-28, as well as for IGHV1-69 (31). They also claimed there can be multiple copies of the IGHV3-23 sequence on a single chromosome (32), and others have reported duplication of IGHV4-31 (33).

This in fact would naturally lead to observe more than 2 alleles of the same gene.

IGoR

On top of the fact that several variants allelic variants could be present on the same chromosome, there are other shortcomings due to the sequencing process. Because alleles of the same gene may vary by a single nucleotide and because the sequencing process is both error prone (i.e could introduce such single nucleotide variations) and have finite read length (meaning not all nucleotides of the gene/allele can be observed), it is not always possible to distinguish between two different alleles and one can only assign posterior probabilities on the gene/allele identity. This leads to assigning non zero probability to most alleles.

Restricting the number of alleles used in IGoR.

Tuning gene and allele usage to your dataset.

It has been shown that gene/allele usage frequencies are the most variable components of the recombination machinery across individuals and sequencing technologies (see IGoR's paper for a more detailed discussion).

To perform any computation on your dataset it might be interesting to first use the inference mode of IGoR and only relearn the gene usage frequencies for your dataset using the --infer_only command. Of course one should be careful on the kind of sequences used to re-infer those frequencies as gene usage frequencies might be modified by selection. Using non-productive or productive sequences for instance should be properly thought.

Manually restricting the number of gene/alleles available for a dataset.

In order to restrict the number of genes/alleles to a limited list (e.g to generate sequences with a particular VJ combination) the user can supply such a list via the -set_genomic General Command:

If the set of provided genomic templates is already fully contained (same name and same sequence) in the loaded model (default, custom, last_inferred), the missing ones will be set to zero probability keeping the ratios of the others. For instance providing only one already known genomic template will result in a model with the considered gene usage to be 1.0, all others set to 0.0. When using this option and introducing new/modified genomic templates, the user will need to re-infer a model since the genomic templates will no longer correspond to the ones contained in the reference models, the model parameters are thus automatically reset to a uniform distribution.

Thus supplying a FASTA files containing only the desired V and J alleles will automatically restrict the usage to these genes without the need for the user to re-infer a model, provided these genes/alleles were already contained in the initial gene list.

In a close future I'd like to introduce such notions in a more complete wiki/manual of IGoR, thus please tell me if anything remains unclear from this answer!

Best,

Quentin