prihoda / AbNumber

Convenience Python APIs for antibody numbering using ANARCI
MIT License
80 stars 11 forks source link

ChainParseError: 2 antibody domains in sequence #7

Open deweihu96 opened 2 years ago

deweihu96 commented 2 years ago

anarci supports 2 domains in one sequence, while abnumber does not

abnumber.exceptions.ChainParseError: Found 2 antibody domains in sequence: "DIQLTQSPSFLSASVGDRVTITCSARSSISFMYWYQQKPGKAPKLLIYDTSNLASGVPSRFSGSGSGTEFTLTISSLEAEDAATYYCQQWSSYPLTFGQGTKLEIKGGGSGGGGEVQLVESGGGLVQPGGSLRLSCAASGFTFSTYAMNWVRQAPGKGLEWVGRIRSKYNNYATYYADSVKDRFTISRDDSKNSLYLQMNSLKTEDTAVYYCVRHGNFGNSYVSWFAYWGQGTLVTVSSGGCGGGEVAALEKEVAALEKEVAALEKEVAALEKGGGDKTHTCPPCPAPEAAGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAKTKPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALPAPIEKISKAKGQPREPQVYTLPPSREEMTKNQVSLWCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLSPGK"

prihoda commented 2 years ago

Hi @deweihu96, thanks for reporting this, I would like to support this in the future. A pull request would be welcome.

The current AbNumber Chain object can only hold a single variable domain, with a single CDR3, etc. So probably this cannot be supported using chain = Chain(seq, 'imgt'), but using a separate call like chains = Chain.parse_domains(seq, 'imgt').

So if you have a sequence like Var1Const1Var2Const2, you should get two Chain objects where the chain.tail corresponds to any sequence that immediately follows the variable domain (chain1.tail = "Const1")

deweihu96 commented 2 years ago

Hi @prihoda ~ Thanks for your reply. The simplest way that I came up with is:

  1. Use anarci to find two domains, and slice the sequences in two domains;
  2. Use abnumber to do numbering on two sequences.
prihoda commented 2 years ago

@deweihu96 sounds good. Can you share the part of the code where you parse the anarci output?

deweihu96 commented 2 years ago

@prihoda

>>> import anarci
>>> seq = 'QIQLVQSGSELKKPGASVKVSCKASGYTFTHYAMNWVRQAPGQGLEWMGWINTNTGEPTYAQGFTGRFVFSLDTSVSTAYLQISSLKAEDTAVYYCAREREPGMDEWGQGTLVTVSSGGGGSSSSSSDVVMTQSPLSLPVTLGQPASISCRSSQSLVHANTNTYLEWYQQRPGQSPRLLIYKVSNRFSGVPDRFSGSGSGTDFTLKISRVEAEDVGVYYCFQGTHVPNTFGQGTKLEIK'
>>> sequences, numbered, alignment_details, hit_tables =  anarci.run_anarci(seq,'kabat',allowed_species='human')
>>> alignment_details                                                         
#[[
#{'id': 'human_H', 'description': '', 'evalue': 1.4e-55, 'bitscore': 178.0, 'bias': 1.0, 'query_start': 0, 'query_end': 117, 'species': 'human', 'chain_type': 'H', 'scheme': 'imgt', 'query_name': 'Input sequence'}, 
#{'id': 'human_K', 'description': '', 'evalue': 1.9e-56, 'bitscore': 180.6, 'bias': 0.1, 'query_start': 127, 'query_end': 239, 'species': 'human', 'chain_type': 'K', 'scheme': 'imgt', 'query_name': 'Input sequence'}]]

Once you have the start and end positions, slice the sequence and parse them with abnumber: )

I noticed that you're also one of the authors of biophi. I want to say that's a really great job!