oxpig / ANARCI

Antibody Numbering and Antigen Receptor ClassIfication
BSD 3-Clause "New" or "Revised" License
161 stars 84 forks source link

Can ANARCI support hmm search with a specific chain? #74

Open yzhougnf opened 6 months ago

yzhougnf commented 6 months ago

We often need to explicitly search CDRs for a heavy chain or a light chain. E.g., we have a nanobody sequence (with heavy and light chain fused), we don't want to get heavy chain CDRs or light chain CDRs randomly. We prefer to use anarci to first scan for heavy chain CDRs, and then scan for light chain CDRs.

This means we should modify the following function to consider an optional chain type argument.

_parse_hmmer_query(query, bit_score_threshold=80, hmmer_species=None, chain_type=None):

            if chain_type is None: return hsp_list
            hsp_list=[hsp for hsp in hsp_list if hsp.hit_id.endswith(f"{chain_type}")]

Thanks for the consideration.

ywilke commented 6 months ago
anarci(sequences, scheme="imgt", database="ALL", output=False, outfile=None, csv=False, allow=set(["H","K","L","A","B","G","D"]), 
           hmmerpath="", ncpu=None, assign_germline=False, allowed_species=['human','mouse'], bit_score_threshold=80)

If you want only heavy chain you could do allow=set(["H"])

yzhougnf commented 6 months ago

It seems like this is a bug. the reason is chain_type in number_seqquce_from_alignment is not enforced.

When I set allow to set(["L"]), it returns both "K" and "L" based on the code:

       elif chain_type in "KL":
            return number_chothia_light(state_vector, sequence)

In my case the "K" hit happens to be the one with higher bit_score, so it returns the "K" result instead of the "L" as specified. Could this be fixed? Thanks.

ywilke commented 6 months ago

The K and L stand for the Kappa and Lambda chain which are both light chains. https://en.wikipedia.org/wiki/Immunoglobulin_light_chain

I am not a developer by the way.

yzhougnf commented 6 months ago

I check further and I think I found the root cause. My sequence matches both K and L, K is the best hit and L is the second hit.

In _parse_hmmer_query returns "top_descriptions". However, the chain_type is set to the chain type of the top hit. I.e., in my case, althoug h we have two hits K and L. The chain type is set to "K".

Later in number_sequences_from_alignment, we have ''' if state_vector and details["chain_type"] in allow: ''' But since details["chain_type"] is "K" and the allow list is ("L"), the hit was dropped. So I ended up with no hits, when I set allow=set(["L"]).

The L-chain hit was seconded to the K-chain hit, and it's chain type was not used. So the L-chain hit did not survive in the "allow" filter.

To me, it seems one solution is in my first post, i.e., to consider "allow" logic within _parse_hmmer_query, so that "K" top hit will be filtered out. Then the "L" hit can survive.

yzhougnf commented 6 months ago

Or the document should say "K" and "L" should not be used alone, they must be used together as set(["K","L"]). I am doing that as the workaround. Thanks.