So fast! Is retraining possible?

paoslaos commented 6 months ago

Dear developers,

congratulations on this very efficient implementation! I came to ask if it is possible to re-train your model? I did not see any hints about this, am I correct to assume that it is not supported?

Sincerly, P.

jlparkI commented 6 months ago

Thanks, much appreciated! Yes it is possible. The model we used is a very simple categorical mixture, and there is code for fitting it to antibody sequences (or any other multiple sequence alignment) at https://github.com/jlparkI/categorical_mixture, which is a separate repository that can be installed using pip if you're interested in using it -- if you run into any issues or find some obvious improvement we can make let us know. It uses both multithreading and Python's multiprocessing library for parallelization. For fitting a small number of clusters (e.g. 100) to a relatively small number of sequences (e.g. 1 million) it's already quite fast (although there is undoubtedly some room for further optimization). For fitting a large number of clusters to a large number of sequences (e.g. 2,000 clusters to 50 million sequences), it may need some modifications to make it more compatible with some other parallelization approach (depending on how your compute cluster is set up and what kinds of options are available).

paoslaos commented 6 months ago

Thank you @jlparkI, very good answer. If I may, I have another question for data preparation. What are the valid imgt positions you are using? I saw for AbLSTM which uses a similar approach for formatting the input that their data file goes up to length 150. Should it be 128?

Am I correct that depending on the numbering you define, I can extract the CDR sequences with these indices? https://github.com/jlparkI/AntPack/blob/d634a2ad810c6d141428d2d191db54d9a4abcd85/antpack/numbering_tools/constants/imgt_default_params.py#L38 However, but I do not see insertions code for CDR there (111A/B) etc.

Thank you!

Sincerly, P.

jlparkI commented 6 months ago

So there are different definitions of CDRs depending on which numbering scheme you use. If you use the IMGT numbering scheme (default), you can find the CDR definitions here:

https://www.imgt.org/IMGTScientificChart/Nomenclature/IMGT-FRCDRdefinition.html

This paper has a nice spreadsheet that illustrates how the different numbering schemes compare and how the CDR definitions compare. Since CDR1 in the IMGT scheme is from positions 27-38, for example, if there are insertions in 27 - 38, they would also be in the CDR -- so 32A or 33A or 33B would also be in the CDR for example. It is possible to have insertions outside of CDRs although most insertions tend to be in CDRs.

The IMGT scheme technically has 128 positions, but very frequently there are insertions in CDRs. So, you might have some sequences in your dataset that are length 128, you might have a few that are shorter due to some deletions, and some that are significantly longer due to insertions (e.g. length 150). CDR3 is the most variable in length.

Would it be helpful if AntPack were to provide an option in which CDRs are automatically indicated when numbering output?

paoslaos commented 6 months ago

Thanks for your insightful response once more.

I think it would be helpful to retrieve the sequence, similar to what abnumber or anarci do. E.g, in the simplest case, provide a series with FR1, CDR1, FR2 etc as index or alternatively, some properties that allow to do seq_analysis.get_cdr(sequence, numbering).

But if I understood correctly, its just a matter of slicing according to the definitions in the linked code and then remove any gap characters. Some more convenient constant definitions would probably also help with the slicing. Here is a very hacky and not really safe way to demonstrate what I mean:

import abnumber
import pandas as pd
from antpack import SingleChainAnnotator

pd.set_option('display.max_rows', 150)

# random sequence from sabdab with insertions
seq = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSRYGMSWVRQAPGKRLEWVATISIGGTYTYYPDSMKGRFTISRDNSKNTLYLQMNSLRAEDTAMYYCARRGYGQYSYYGLDYWGQGTLVTVSS"
seq_ar = list(seq)

# antpack
aligner = SingleChainAnnotator(chains=["H", "K", "L"], scheme="imgt", compress_init_gaps=False)
numbered, chain, _, _ = aligner.analyze_seq(seq)
antpack_df = pd.DataFrame({"sequence": seq_ar, "numbered": numbered})

# abnumber as comparison
ab_chain = abnumber.Chain(seq, scheme="imgt", cdr_definition="imgt")
cdr1_abnumber = ab_chain.cdr1_seq
cdr2_abnumber = ab_chain.cdr2_seq
cdr3_abnumber = ab_chain.cdr3_seq

# not really safe for frws?
# source: https://www.imgt.org/IMGTScientificChart/Nomenclature/IMGT-FRCDRdefinition.html
cdr_1_regex = "^" + "|^".join("27 28 29 30 31 32 33 34 35 36 37 38".split())
cdr_2_regex = "^" + "|^".join("56 57 58 59 60 61 62 63 64 65".split())
cdr_3_regex = "^" + "|^".join(
    "105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117".replace(",", "").split())

cdr1_antpack_df = antpack_df[antpack_df["numbered"].str.contains(cdr_1_regex)]
cdr1_antpack = "".join(cdr1_antpack_df["sequence"])

cdr2_antpack_df = antpack_df[antpack_df["numbered"].str.contains(cdr_2_regex)]
cdr2_antpack = "".join(cdr2_antpack_df["sequence"])

cdr3_antpack_df = antpack_df[antpack_df["numbered"].str.contains(cdr_3_regex)]
cdr3_antpack = "".join(cdr3_antpack_df["sequence"])

print(cdr1_antpack)
print(cdr1_abnumber)

print(cdr2_antpack)
print(cdr2_abnumber)

print(cdr3_abnumber)
print(cdr3_antpack)

assert cdr1_antpack == cdr1_abnumber
assert cdr2_antpack == cdr2_abnumber
assert cdr3_antpack == cdr3_abnumber

Which prints:

GFTFSRYG
GFTFSRYG
ISIGGTYT
ISIGGTYT
ARRGYGQYSYYGLDY
ARRGYGQYSYYGLDY

Sincerly, P.

jlparkI commented 6 months ago

OK, I've added some functionality that should make this quite straightforward, all included in the new version, 0.2.5, which is now on PyPi.

First: you can now get the CDR and framework assignments when you number a sequence. Also, there's now functionality for sorting a list of position codes that correspond to a given scheme The numbering example in the documentation illustrates how to do both of these things.

For indicating CDR / framework regions, just use:

my_annotator.analyze_seq(seq, get_region_labels = True)

The get region labels argument defaults to False; if you set it to True, in addition to returning the usual things, the SingleChainAnnotator will now also return a list (of the same length as the sequence) where each position is one of "-", "fmwk1", "cdr1", "fmwk2", "cdr2", "fmwk3", or "cdr3". Then you can pull whichever one you are looking for out easily. The CDR definitions used are those which correspond to the selected scheme.

Also, let's say you have a list of sequences and are trying to merge them into an MSA (maybe you want to convert them to a fixed-length numpy array e.g. for input to XGBoost or if you want to fit this categorical mixture model). You can get a list of all the unique position codes that occur as you number your sequences, for example:

all_observed_position_codes = set()

numbering = []

for seq in additional_sequences:
    results = my_annotator.analyze_seq(seq)
    numbering.append(results[0])
    if results[3] != "":
        # If there is an error message indicating a bad alignment, do something here
        pass

    # Add all position codes to the set
    all_observed_position_codes.update(results[0])

all_observed_position_codes = list(all_observed_position_codes)

But this list will of course be out of order. So, you can sort it using the SingleChainAnnotator tool by calling:

# Important caveat: "-" is not a valid position code. So remove it before sorting.
all_observed_position_codes = [a for a in all_observed_position_codes if a != "-"]
all_observed_position_codes = my_annotator.sort_position_codes(all_observed_position_codes, scheme = "imgt")

which will put it in the correct order for that scheme (for IMGT, for example, ["111", "111A", "111B", "112B", "112A", "112"] is correct ordering because 111 is an insertion point in a CDR, while ["80A", "80B", "80C", "81"] is correct ordering because 80 is not an insertion point in a CDR). Then you can easily create a dictionary mapping the position codes in your sequences to positions in the fixed-length array or MSA you are creating, e.g.:

position_dict = {k:i for i, k in enumerate(all_observed_position_codes)}

If the added functionality doesn't do what you need, let me know. Otherwise -- closing this issue. Thanks.

jlparkI commented 4 months ago

In case this is still useful, just a heads-up that building a multiple sequence alignment from a set of sequences is easier with some new features in v0.3. sort_position_codes has been simplified and SingleChainAnnotator / PairedChainAnnotator now both have the method build_msa. See docs .

jlparkI / AntPack

So fast! Is retraining possible? #9