steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
814 stars 100 forks source link

Alignment length is longer than target/query/both #65

Closed chanqian18 closed 8 months ago

chanqian18 commented 1 year ago

In Foldseek outputs, qcov and tcov are the aligned parts of the query and target respectively over length of the sequence. For my use case, it could be useful to have the overlap of the target and query, as I would like to filter for only results where a certain fraction (0.6) of the query is covered by the alignment. In some cases, it seems that the alignment length is longer than the target length (image 1), making the filter qcov>0.6 miss these results. These alignments give high alntmscore, but only align to small regions (perhaps one helix, image 2). Is this an expected behaviour of foldseek? How is alignment length calculated?

image

image

Your Environment'

Foldseek parameters: foldseek search temp -s 9 --alignment-type 1 -a

chanqian18 commented 1 year ago

Attached is a result where alignment length exceeds both, which I forgot to attach in the original post. image

martin-steinegger commented 1 year ago

We produce local alignments, the alnlen is the total length of the alignment including gaps for deletions or insertions. Coverage is the faction of residues covered by either query qcov or target tcov.

chanqian18 commented 1 year ago

I'm working with Christine Orengo and we're trying to do some scans of AlphaFold domains against PDB chains. I am a bit confused about which coverage settings/parameters (-- cov-mode?) to use on Foldseek-TMalign in order to get only hits that cover at least 60% of the domain I use as a query. qcov, as in the original post, doesn't seem to be appropriate to me (as the target is 50 residues, the query is 249 residues, but qcov is 84.7%).

Sorry if i misunderstood any of the documentation. Thank you for your time!

martin-steinegger commented 1 year ago

@chanqian18 to annotate the alphafold domains with cath domains I would recommend using

foldseek search afdb cath afdb_cath_aln tmp --max-seqs 10000 --cov-mode 1 -c 0.6
chanqian18 commented 1 year ago

Thank you for your quick replies!

As of now, we were not yet trying to match AlphaFold predicted domains to CATH domains, rather to PDB domains. We would like to avoid the matching of small regions in PDB to the parts query (where target is much shorter than the query); and would like at least 60% of the query be covered by the alignment with a target. So, would --cov-mode 2 -c 0.6 be appropriate?

martin-steinegger commented 1 year ago

Yes, --cov-mode 2 -c 0.6 is right. Did this work for you?