A multi-objective based clustering for inferring BCR clonal lineages from high-throughput B-cell repertoire data
MobiLLe is a new method based on multi-objective clustering to detect clonally-related sequences in BCR repertoires. It requires V(D)J annotations to obtain the initial clones and iteratively applies two objective functions that optimize cohesion and separation within clonal lineages simultaneously. MobiLLe is computationally more efficient and quicker than state-of-art tools and does not require any training process or hyper-parameter optimization. It can easily manage large-scale experimental repertoires, providing useful plots to help researchers detect clonally-related sequences in high-throughput B cell repertoire data.
REFERENCE
Nika Abdollahi, Lucile Jeusset, Anne de Septenville, Hugues Ripoche, Frederic Davi and Juliana Silva Bernardes. "A multi-objective based clustering for inferring BCR clonal lineages from high-throughput B cell repertoire data." PLoS computational biology 18, no. 8 (2022): e1010411.
CONTACT
E-mail:
juliana.silva_bernardes@sorbonne-universite.fr
nikaabdollahi@gmail.com
MobiLLe returns:
9 tab delimited files:
The columns are:
Clonal lineage id abundance number of reads Clonotype abundance, functionality IGHV_and_allele IGHJ_and_allele Clonotypes CDR3, sequence_id
The columns are:
cluster_Id abundance
Each line contains the cluster id and sequence ids.
cluster_Id seqid1 seqid2 ...
cluster_Id seqid1 seqid2 ...
Cluster_id__clonotype_id seq Id functionality IGHV_and_allele IGHJ_and_allele CDR3 Junction
cluster_Id uniformity
A png file containing example
A) Circle representation of the clonal lineages' uniformity and abundances. Each circle symbolizes a clonal lineage and and circle area its abundance. We represent only the 20 most abundant groups.
B) Number of sequences in each clonal lineage, all groups are represented, vertical axe is in log scale.
C) Lorenz curve and Gini index. A Lorenz curve shows the graphical representation of clonal inequality. On the horizontal axe, it plots the cumulative fraction of total clonal lineages when ordered from the less to the most abundant; On the vertical axe, it shows the cumulative fraction of sequences.
D) Size distribution (percentage) of the 100 most abundant clonal lineages.
We strongly recommend anaconda environment.
Python version 3 or later
numpy :
conda install numpy
or
pip install numpy
matplotlib
conda install -c conda-forge matplotlib
or
pip install matplotlib
Palettable :
conda install -c conda-forge palettable
or
pip install palettable
skbio
conda install -c https://conda.anaconda.org/biocore scikit-bio
or
pip install scikit-bio
Levenshtein
conda install -c conda-forge python-levenshtein
or
pip install python-Levenshtein
In the MobiLLe file run the following command:
$ bash Mobille.sh -p [input_repertoire_name] -o [output_repertoire_name] -i [analysis_name] [options]
or you can pass a file that contains your parameters (example) :
$ bash Mobille.sh -f [parameters_file]
[output_newick_file] is the output directory path Output files will be placed as such:
~[output_repertoire_name]/[analysis_name]_cluster_distribution.txt
[analysis_name ]_final_clusters_Fo.txt
[analysis_name]_final_clusters_seq_info.txt
[analysis_name ]_initial_clusters_Fo.txt
[analysis_name]_unannotated_seq.txt
[analysis_name]_repertoire.png
s : CDR3 amino acid identity threshold (by default 0.7) for the initial clustering step (between 0 and 1)
Exemple :
$ bash Mobille.sh -p Input/toy_dataset/ -o Output/toy_dataset/ -i toy_dataset -t 0 -s 0.8 -q 1 -m 0 -r 1 -v 2 -j 2 -c 2 -d 123