qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

Add built in support the mouse igH analysis #25

Open adugduzhou opened 5 years ago

adugduzhou commented 5 years ago

Hi, Quentin Could you add an database for IGoR so it can support mouse IgH analysis? In addition, I have some mouse IgH sequencing data, and could you show me how to build the customized mouse IgH model step by step? Highly appreciated!

Best Zhou

qmarcou commented 5 years ago

Dear Zhou, Regarding your first question: sadly I currently do not have access to any IgH mouse dataset, and would not have the time to build a model myself...

Regarding building a model for your data: You might already find most information of interest un the documentation by reading carefully the Alignments, Inference and Advanced usage sections of the documentation.

Alignments

The first step is to align your sequences. This has proven to be a critical step in order to obtain accurate models and I strongly encourage you to optimize and double check this step by adjusting the different thresholds and parameters. Most importantly I strongly recommend you to use your knowledge of the sequencing process (e.g primers positions) to constrain the alignment offsets (global offsets or template specific offsets) via the dedicated command line options.

Building a new model for an immune chain

Starting from an existing model topology

This is most likely the simplest option. Use the model topology (model_parms.txt) provided along with an existing entry in IGoR's model folder by loading it using the -load_custom_model command line, providing no path to a marginal file. IGoR will automatically create the corresponding marginal file with a uniform distribution. See the Advanced Usage section of the documentation.

Building a new model topology from scratch

IGoR's model is encoded by a Bayesian network for which a C++ API is provided. For this you will need to use the C++ interface. An example of the syntax is given in the -run_demo section of the main.cpp file. Here are the simple steps one should take:

One can create a new recombination event, say e.g V choice:

//Construct a TCRb model
Gene_choice v_choice(V_gene,v_genomic);
v_choice.set_nickname("v_choice");
v_choice.set_priority(7);

Add this event as a node in the Bayesian network:

// Create the model graph
Model_Parms parms;

//Add nodes to the graph
parms.add_event(&v_choice);

Finally add conditional dependencies (directed edges) between events as desired:

// Add conditional dependency between V choice and number of deletions on 3' side of the V gene
// This will lead IGoR to record P(v_3_del|V)
parms.add_edge(&v_choice,&v_3_del);

Self consistency checks

In a nutshell: there is no perfect recipe, the only thing I can advise is to remain cautious at every step (alignment, inference, model selection etc).

Inference convergence

First of all, although it may sound obvious you should make sure that the inference procedure has converged after the number of iterations you have specified have been accomplished. For this you can look at the log likelihood and/or parameters convergence (i.e how much they keep varying in the last iterations). The log likelihoods are given in the likelihoods.out along with the number of sequences for which at least one scenario has been attributed (and thus have contributed to the model fitting). You should make sure with this number of sequences is not leaving too many sequences apart and/or it is not systematically leaving some sequences with special features apart (e.g sequences with hypermutations) thus introducing a bias, unless this is something controlled and desired. You can find which sequence has been assigned a 0 likelihood at a given iteration in the inference_logs.txt file.

Check against simpler methods

Although far from perfect it is always good to check that the results you obtain at least make sense vs simpler methods based on alignments.

Check against synthetic data

It is generally good to make sure that your different alignments and inference parameters make sense regarding the sequencing dataset you have. You can perform this check by generating synthetic sequences from your model learned with IGoR. Then cut/edit your sequences in a way that mimics your sequencing data (try to mimic single end sequencing our paired end sequencing by cutting your sequences in regions corresponding to your primers, add a C part after the J if your primers lie in the C part, etc). Then align and re-infer a model based on your synthetic sequences and check the consistency between your originally inferred model and the one inferred on the synthetic data (e.g by computing the DKL between the two distributions using pygor). See the IGoR paper for technical details.

Check consistency between datasets

Pretty straightforward: simply make sure that by using biological replicates, other donors or even other sequencing techniques that your results hold. This is probably the most bulletproof check, and will also give you information about the limitations of different sequencing technologies.

Check higher order statistics

Another model design strategy is to compute quantities that are not directly encoded in your recombination model. For instance compute the expected amount of shared sequences as done in Pogorelly et al and check that prediction of your model against the actual fraction of shared sequences in your dataset. Another example is what we have done in IGoR's paper to demonstrate the existence of double Ds. Yet another example is to compute mutual information between recombination events in the data (e.g using the scenario counter) vs the conditional dependencies encoded in your recombination model's Bayesian network. On this one I am afraid our possibilities are only limited by our imagination.

Hope this helps, Best, Quentin

decenwang commented 5 years ago

Hi @qmarcou in your comments, you mentioned:

The log likelihoods are given in the likelihoods.out along with the number of sequences for which at least one scenario has been attributed (and thus have contributed to the model fitting). You should make sure with this number of sequences is not leaving too many sequences apart and/or it is not systematically leaving some sequences with special features apart (e.g sequences with hypermutations) thus introducing a bias, unless this is something controlled and desired. I tried with my TCR b 100 sequences, in the likelihoods.txt file, it 's written: iteration;mean_log_Likelihood;n_seq 1;-17.8226;87 2;-14.4146;88 3;-14.1677;88 4;-14.0537;88 5;-14.0069;88 6;-13.6401;85 7;-13.5891;85 8;-13.7406;85 9;-13.5879;85 it iterated 9 times. Please look at this result.

  1. which is the best iteration times?
  2. what does "You should make sure with this number of sequences is not leaving too many sequences apart " mean?
  3. my sequence is 100, and here, it decreased upto 85, why is that? Thanks!