nmdp-bioinformatics / imgt2aa

extract aligned amino acid sequences from IMGT/HLA
GNU Lesser General Public License v3.0
4 stars 3 forks source link

performance issues #3

Open mmaiers-nmdp opened 7 years ago

mmaiers-nmdp commented 7 years ago

@pbashyal-nmdp reported slow performance. @mhalagan-nmdp pointed out that biopython has an parser for the "IMGT variant of the EMBL plain text file format". So one idea is to port this to python. The advantage of reading from HLA.DAT is that this could easily be adapted to KIR.DAT where there does not exist a comparable KIR.XML.

http://biopython.org/wiki/SeqIO

bmilius-nmdp commented 7 years ago

I played with the biopython IMGT parser a couple of months ago and had some issues with it, but don't remember what exactly. You might want to make sure it works the way you want it to do.

mhalagan commented 7 years ago

It works really well for me.

These two lines of code will parse the dat file, extract the exons and then translate the sequence:

seq_list = SeqIO.parse("hla.data", "imgt")
full_sequences = [ [record.name,[[feat.type,record.seq[feat.location.start:feat.location.end]] for feat in record.features]] for record in seq_list]