MobiLLe

A multi-objective based clustering for inferring BCR clonal lineages from high-throughput B-cell repertoire data

MobiLLe is a new method based on multi-objective clustering to detect clonally-related sequences in BCR repertoires. It requires V(D)J annotations to obtain the initial clones and iteratively applies two objective functions that optimize cohesion and separation within clonal lineages simultaneously. MobiLLe is computationally more efficient and quicker than state-of-art tools and does not require any training process or hyper-parameter optimization. It can easily manage large-scale experimental repertoires, providing useful plots to help researchers detect clonally-related sequences in high-throughput B cell repertoire data.

REFERENCE
Nika Abdollahi, Lucile Jeusset, Anne de Septenville, Hugues Ripoche, Frederic Davi and Juliana Silva Bernardes. "A multi-objective based clustering for inferring BCR clonal lineages from high-throughput B cell repertoire data." PLoS computational biology 18, no. 8 (2022): e1010411.

CONTACT
E-mail: juliana.silva_bernardes@sorbonne-universite.fr nikaabdollahi@gmail.com

Inputs

The IMGT/HighV-QUEST's AIRR file must be provided:
- vquest_airr.tsv
See example input file
You can use any V(D)J annotation software, but input should be formatted as above.

Outputs

MobiLLe returns:
- 9 tab delimited files:
  - [repertoire_name]_repertoire_two_levels_info.txt : each lines contains informations of a clonal lineage example
  The columns are:
```
Clonal lineage id   abundance   number of reads   Clonotype abundance, functionality   IGHV_and_allele   IGHJ_and_allele   Clonotypes CDR3, sequence_id
```
  - [repertoire_name]_cluster_distribution.txt : clusters and their abundance sorted from highest to lowest example
  The columns are:
```
cluster_Id   abundance
```
  - [repertoire_name]_initial_clusters_Fo.txt : initial clustering output. Sequences with the same IGHV and IGHJ genes, same CDR3 sequence length, and CDR3 identity higher than s% are grouped together example
  Each line contains the cluster id and sequence ids.
```
cluster_Id   seqid1 seqid2 ...
```
  - [repertoire_name]_final_clusters_Fo.txt : final clustering output, after minimizing intraclonal distances and maximizing interclonal distances example
```
cluster_Id   seqid1 seqid2 ...
```
  - [repertoire_name]_clusters_seq_info.txt : each line contains the following information for each sequence example:
```
Cluster_id__clonotype_id   seq Id  functionality  IGHV_and_allele IGHJ_and_allele CDR3 Junction
```
  - [repertoire_name]_clone_uniformity.txt : uniformity of each clusters example
```
cluster_Id   uniformity
```
  - [repertoire_name]_clone_V_CDR3_J.txt example
  - [repertoire_name]_sameVJ_noallele_CDR3_0.7.txt example
  - [repertoire_name]_seq_Fo_V_CDR3_Jseq.txt example These 3 files contain informations of the sequences of each clusters.
- A png file containing example
  
  A) Circle representation of the clonal lineages' uniformity and abundances. Each circle symbolizes a clonal lineage and and circle area its abundance. We represent only the 20 most abundant groups.
  
  B) Number of sequences in each clonal lineage, all groups are represented, vertical axe is in log scale.
  
  C) Lorenz curve and Gini index. A Lorenz curve shows the graphical representation of clonal inequality. On the horizontal axe, it plots the cumulative fraction of total clonal lineages when ordered from the less to the most abundant; On the vertical axe, it shows the cumulative fraction of sequences.
  
  D) Size distribution (percentage) of the 100 most abundant clonal lineages.

Requirements

We strongly recommend anaconda environment.
Python version 3 or later

numpy :

  conda install numpy

  pip install numpy

matplotlib

  conda install -c conda-forge matplotlib

  pip install matplotlib

Palettable :

  conda install -c conda-forge palettable

  pip install palettable

skbio

  conda install -c https://conda.anaconda.org/biocore scikit-bio

  pip install scikit-bio

Levenshtein

  conda install -c conda-forge python-levenshtein

  pip install python-Levenshtein

Using MobiLLe

In the MobiLLe file run the following command:

  $ bash Mobille.sh -p [input_repertoire_name] -o [output_repertoire_name] -i [analysis_name] [options]

or you can pass a file that contains your parameters (example) :

  $ bash Mobille.sh -f [parameters_file]

required arguments

[input_repertoire_name] is the path directory where are input file, for instance: the IMGT/highVquest's output folder path.

[output_newick_file] is the output directory path Output files will be placed as such:

~[output_repertoire_name]/[analysis_name]_cluster_distribution.txt
                        [analysis_name ]_final_clusters_Fo.txt
                        [analysis_name]_final_clusters_seq_info.txt
                        [analysis_name ]_initial_clusters_Fo.txt
                        [analysis_name]_unannotated_seq.txt
                        [analysis_name]_repertoire.png

optional arguments [...options]

s : CDR3 amino acid identity threshold (by default 0.7) for the initial clustering step (between 0 and 1)
- t : Abundance filter, the minimum count of sequence to be considered in the analysis, if -t 0, all the sequences will be analysed
- q : Quality filter, if -q 1, sequences contaning N will be discarded from the analysis (0 : no, 1 :yes)
- r : Apply refining step (0 : no , 1 :yes). If -r 0, there will be no need to provide v, j, c, and m parameters.
- v : V-distance (1- binaire, 2-levenstein,3-GIANA, 4-K-mers) by default -v 1
- j : J-distance (1- binaire, 2-levenstein,3-GIANA, 4-K-mers) by default -j 2
- c : CDR3-distance (1- binaire, 2-levenstein,3-GIANA, 4-K-mers) by default -c 2
- d : Combine distance (1-mean, or three weights for (IGHV, CDR3, and IGHJ), example -d123).
- m : Merging singletons (0 : no, 1 :yes)
Exemple :
```
$ bash Mobille.sh -p Input/toy_dataset/ -o Output/toy_dataset/ -i toy_dataset -t 0 -s 0.8 -q 1 -m 0 -r 1 -v 2 -j 2 -c 2 -d 123
```

License, Patches, and Ongoing Developements

The program is distributed under the CeCILL licence.
Feature requests and open issues.

julibinho / MobiLLe

readme