project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
205 stars 42 forks source link

sequence aligning with align_sequence_to_polymer #302

Closed rimmartin closed 3 months ago

rimmartin commented 4 months ago

Hi @wojdyr & others,

Been trying to use the aligment tool on https://github.com/project-gemmi/gemmi/blob/master/tests/1orc.pdb

        gemmi::Structure st1 = gemmi::read_pdb_file(filePath);
...
                gemmi::AlignmentScoring scoring;
                gemmi::AlignmentResult result = gemmi::align_sequence_to_polymer(st1.entities[entityIndex].full_sequence, st1.models[0].chains[chainIndex].get_polymer(), gemmi::PolymerType::Unknown, &scoring);
                std::cout<<st1.models[0].chains[chainIndex].name<<" "<<st1.entities[entityIndex].name<<" match_count: "<<result.match_count<<" "<<result.score<<std::endl;

and get

A A match_count: 0 1844108996

The match count is zero. Is there something I'm not setting up correctly? align_sequence_to_polymer the right function to use?

wojdyr commented 4 months ago

I couldn't reproduce it:

>>> st = gemmi.read_structure('tests/1orc.pdb')
>>> result = gemmi.align_sequence_to_polymer(st.entities[0].full_sequence, st[0][0].get_polymer(), gemmi.PolymerType.Unknown)
>>> result.match_count
64

I also got the same result from:

gemmi align -v --query=A --target=+A tests/1orc.pdb

Could you post a minimal code that can be compiled?

rimmartin commented 4 months ago

rhel6gemmiapp.tar.gz

I see it toggle between 0 & 4 on repeated runs

This is on a 64 bit Centos 7 and compiled with gcc-10.3.0-04202021

Build and run script in tar

rimmartin commented 4 months ago
                gemmi::AlignmentScoring scoring={ 1, -1, -1, -1, 0, -2, {}, {} };

works better while

                gemmi::AlignmentScoring scoring;

may not be initialized Seemed to be better but still zero match count...

wojdyr commented 4 months ago

Yes, that's it. AlignmentScoring is uninitialized in C++. I'll add member initializers, to make it consistent with Python, but I'll wait with it until I change the minimal standard to C++14. Alternatively, you can use gemmi::AlignmentScoring::simple() as the last arg for align_sequence_to_polymer().

No idea why you still get zero match count, though.

rimmartin commented 4 months ago

Agreed. And low priority for changes for me since there is a way to initialize.

Will keep experimenting to see why I'm not getting a good match count; using partial_model

0 0 A A match_count: 0 score :2097088
1orc 0 aligned: MEQRITLKDYAMRFGQ-TKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVY--AEEVKDGEVKPFP----
1orc        : ................ ...................................  .............     cigar 16M1I35M2I13M4I
1orc alignment completed 
rimmartin commented 4 months ago

From miminal rhel6gemmiapp I now get the match count 64

Looking to replicate in our app. Thanks for the help/conversation today

rimmartin commented 4 months ago

Working in our app; returned to using gemmi::PolymerType::Unknown instead of st1.entities[0].polymer_type as the minimal app was set to gemmi::PolymerType::Unknown :-)

rimmartin commented 4 months ago

one last thing to report: same cigar as the python test "2I64M5I"

rimmartin commented 3 months ago

Hi @wojdyr ,

With the tests/1orc.pdb example when I add

            std::string sequenceRepresentation=result.add_gaps(gemmi::one_letter_code(st.entities[0].full_sequence), 2);

the output shows

 sequenceRepresentation: 
--MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFP-----
 match_string: 
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||      cigar:2I64M5I

comparing to the pdb's SEQRES the one letter code starts at full sequence MET GLU GLN ARG ILE THR... where as the pdb coordinates start at GLN ARG ILE THR...

rimmartin commented 3 months ago

Ah so Lance pointed out you probably increment on the polymer when dashes come out; looking the which =2 is for polymer sequences and 1 is for full sequence

                                    std::string sequenceRespresentation=result.add_gaps(gemmi::one_letter_code((*it_chains).get_polymer().extract_sequence()), 2);

yields

1orc+H 0 aligned: 
--QRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFPSN-----
1orc+H match_string: 
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||      cigar:2I64M5I

an enumeration for the 1 & 2 would help guys like me:-)