snijderlab / stitch

Template-based assembly of proteomics short reads for de novo antibody sequencing and repertoire profiling
MIT License
22 stars 3 forks source link

Support mmCIF files for sequence loading #207

Closed douweschulte closed 1 year ago

douweschulte commented 1 year ago

This would just strip out the sequence for each chain and save that as a different sequence. It should be possible to load these both as template and input reads. Some thought needs to be given to how to implement this. Although just parsing the mmCIF basic structure and then bruteforce searching for the correct pieces of information to retrieve would suffice. This is needed to load data from ModelAngelo into Stitch.

douweschulte commented 1 year ago

The approach is to tokenize/lex the full mmCIF file and afterwards search for the main atomic data loop and retrieve that needed data from there. It cannot handle incorrectly ordered lines in the loop (if one line of a residue is interleaved with another both will be duplicated in the output). The move forward will be to do analysis of these files and see if the program can be improved for the handling of modelangelo data with it.