kliment-olechnovic / voronota

Voronota is a software tool for analyzing three-dimensional structures of biological macromolecules using the Voronoi diagram of atomic balls.
https://kliment-olechnovic.github.io/voronota/
MIT License
25 stars 3 forks source link

Voronota version 1.24 command 'query-balls-clashes' exit error: Incomplete bracketing in descriptor string #6

Closed rubenalv closed 1 week ago

rubenalv commented 1 week ago

Hello,

I am trying to use voronota to calculate clashes between chains, as a filtering method to distinguish between potential OK and definitely bad protein:protein predicted models. I am getting an "incomplete bracketing" error.

Example code:

wget https://files.rcsb.org/download/8WMA.pdb
voronota get-balls-from-atoms-file < 8WMA.pdb > balls.txt
head balls.txt
  106.957 112.311 117.737 1.55 # 1 A 188 SER N . . 
  106.457 112.453 119.099 1.7 # 2 A 188 SER CA . .
voronota query-balls-clashes < balls.txt
  Voronota version 1.24 command 'query-balls-clashes' exit error: Incomplete bracketing in descriptor string '106.957'.
wget https://alphafold.ebi.ac.uk/files/AF-Q96SW2-F1-model_v4.pdb
voronota get-balls-from-atoms-file < AF-Q96SW2-F1-model_v4.pdb > balls.txt
head balls.txt
  25.141 30.65 -32.198 1.55 # 1 A 1 MET N . .
  25.429 29.224 -32.431 1.7 # 2 A 1 MET CA . .
 voronota query-balls-clashes  < balls.txt
Voronota version 1.24 command 'query-balls-clashes' exit error: Incomplete bracketing in descriptor string '25.141'.

Other commands yield output as expected

voronota calculate-vertices < balls.txt | head
0 1 2 4 24.745153674976038 29.779698617969835 -30.967695268777849 0.008129821377602
0 1 2 8 23.637478407987444 30.001338658600282 -33.391815049171896 0.476458261377157
voronota calculate-contacts < balls.txt | head
0 0 43.8943
0 1 11.0425

Am I not using the query-balls-clashes command correctly?

kliment-olechnovic commented 1 week ago

Hello,

For the 'query-balls-clashes' command, the input should be produced by 'get-balls-from-atoms-file' with '--annotated' flag. So, the pipeline may look like this:

voronota get-balls-from-atoms-file --annotated < 8WMA.pdb > balls.txt
voronota query-balls-clashes --clash-distance 3.0 < balls.txt > clashes.txt

the 'clashes.txt' will look like this:

c<A>r<190>a<15>R<GLN>A<O> c<A>r<203>a<99>R<LYS>A<N> 2.90538 -0.164624
c<A>r<192>a<29>R<ILE>A<N> c<A>r<201>a<90>R<VAL>A<O> 2.87901 -0.190985
c<A>r<192>a<32>R<ILE>A<O> c<A>r<201>a<87>R<VAL>A<N> 2.88789 -0.18211
c<A>r<194>a<48>R<VAL>A<N> c<A>r<199>a<79>R<ASN>A<O> 2.85436 -0.215637

but it can be made more table-like with the 'expand-descriptors' command, which expands every annotation token into space separated values: chainID resSeq iCode serial altLoc resName name.

cat clashes.txt | voronota expand-descriptors > table.txt

The ' table.txt will look like this:

A 145 . 1 . SER N A 147 . 17 . GLN N 5.43134 2.33134
A 145 . 1 . SER N A 148 . 29 . ALA N 5.95731 2.85731
A 145 . 1 . SER N A 148 . 33 . ALA CB 5.99677 2.74677
A 145 . 1 . SER N A 149 . 40 . GLU CG 5.81173 2.56173
A 145 . 1 . SER N A 149 . 41 . GLU CD 5.81399 2.56399

You could filter it for inter-chain clashes by only selecting rows where values in columns 1 and 8 differ. However, there is a trick, you can specify minimum residue sequence separation to be very large (e.g 99999), so that only inter-chain pairs satisfy it. The the full pipeline can look like this:

cat 8wma.pdb \
| voronota get-balls-from-atoms-file \
  --annotated \
  --include-heteroatoms \
| voronota query-balls-clashes \
  --min-seq-sep 99999 \
  --clash-distance 3.0  \
| voronota expand-descriptors \
> results.txt

Then the 'results.txt' for 8WMA will contain:

A 429 . 1824 . LYS NZ C 446 . 2695 . LYS O 2.68049 -0.389506

rubenalv commented 1 week ago

That is super-useful, many thanks!

Regarding the last two lines: Then the 'results.txt' for 8WMA will contain: A 429 . 1824 . LYS NZ C 446 . 2695 . LYS O 2.68049 -0.389506 From the docs I understood that the last two columns in an annotation are the contact area and the distance between the centers of the two atoms. If it is so, how do I interpret the distance being negative (-0.389)?

I run a few tests:

cat 8WMA.pdb | voronota get-balls-from-atoms-file   --annotated   --include-heteroatoms | voronota query-balls-clashes   --min-seq-sep 99999   --clash-distance 2.6  | voronota expand-descriptors
# none
cat 8WMA.pdb | voronota get-balls-from-atoms-file   --annotated   --include-heteroatoms | voronota query-balls-clashes   --min-seq-sep 99999   --clash-distance 2.68  | voronota expand-descriptors
# none
cat 8WMA.pdb | voronota get-balls-from-atoms-file   --annotated   --include-heteroatoms | voronota query-balls-clashes   --min-seq-sep 99999   --clash-distance 2.6805  | voronota expand-descriptors
A 429 . 1824 . LYS NZ C 446 . 2695 . LYS O 2.68049 -0.389506

It looks like --clash distance filters on the value of what I thought was the area?

Also, does the clashing function work on the centers of atoms, or does it consider the surface of the proteins? The reason I am asking is that I have tried simple python functions to calculate the distance, and they work as long as the protein volumes do not intersect. I have prediction cases where the proteins intersect and are not detected by the distance function: eg say we have two protein rings of identical radius that is longer than the clash threshold -if the surface of each ring crosses the center of the other one (and the ring planes are perpendicular), which would not make sense in a protein:protein interaction, no clash is detected using just the distance.

kliment-olechnovic commented 1 week ago

Hello,

No, 'query-balls-clashes' is just for distance-based clashes, not for tessellation-derived contact areas!

Running

voronota query-balls-clashes --help

gives

--clash-distance                number      clash distance threshold in angstroms, default is 3.0
--min-seq-sep                   number      minimum residue sequence separation, default is 2
--init-radius-for-BSH           number      initial radius for bounding sphere hierarchy
--help                                      flag to print usage help to stdout and exit
stdin   <-  list of balls (line format: 'annotation x y z r')
stdout  ->  list of clashes (line format: 'annotation1 annotation2 distance min-distance-between-balls')

So, the output line format is: 'annotation1 annotation2 distance min-distance-between-balls'. The annotation1 annotation2 are expanded after 'voronota expand-descriptors', but the last two values are in any case 'distance' and 'min-distance-between-balls'.

If you want to use tessellation derived contacts, you will need to first construct them (https://www.voronota.com/#computing-annotated-inter-atom-contacts), then you can filter them (https://www.voronota.com/#querying-annotated-inter-atom-contacts). A relatively comfortable way to define a construct+query pipeline is using https://bioinformatics.lt/wtsam/vorocontacts page, where you can define a query and then download the script.

In general, I think, constructing tessellation-based contacts may be an overkill for just identifying extreme unrealistic clashes, so 'query-balls-clashes' may be enough.

If you still want contact areas, but computed much faster than with the OG Voronota, check out Voronota-LT "https://www.voronota.com/expansion_lt/", that also has somehow more sane interface.

For example:

voronota-lt -i 8wma.pdb --compute-only-inter-chain-contacts --write-contacts-to-file contacts.tsv

will produce a table in 'contacts.tsv' that contains Voronoi tessellation-based contacts and looks like this:

ca_header  ID1_chain  ID1_residue  ID1_atom  ID2_chain  ID2_residue  ID2_atom  ID1_index  ID2_index  area        arc_legth  distance
ca         A          327|TRP      CE3       C          445|LEU      CG        1045       2687       0.00469616  0          5.42041
ca         A          327|TRP      CE3       C          445|LEU      CD2       1045       2689       0.862091    0.561291   5.5653
ca         A          327|TRP      CZ3       C          445|LEU      CG        1047       2687       9.1327      0.645198   4.62657
ca         A          327|TRP      CZ3       C          445|LEU      CD2       1047       2689       2.43678     0.641754   5.03981
ca         A          327|TRP      CZ3       C          445|LEU      CD1       1047       2688       1.0007      0.886454   5.09493
ca         A          327|TRP      CZ3       C          445|LEU      CA        1047       2683       0.356817    0.614441   5.77186
ca         A          327|TRP      CH2       C          445|LEU      CA        1048       2683       0.397446    0.737037   6.05122
ca         A          330|ALA      O         C          444|TRP      CZ2       1071       2679       3.49465     0          4.10385
ca         A          331|ALA      O         C          444|TRP      CZ2       1076       2679       1.79317     0          4.68959
rubenalv commented 1 week ago

I've given myself a scold for not reading the documentation well (twice). I am at the protocol development stage, and since I am new to the field even knowing exactly what I need to solve the problems takes time... sifting through docs backfires then! So I really appreciate the time you have taken to answer. The goal is to determine whether two proteins are likely to interact (when e.g. one protein is a mutant, so it is not registered in a database like IntAct), starting from proteome scale, so I am starting with fast and less accurate methods to progressively reduce search space to use slower and more accurate approaches.

So yes, thank you very much for the help and examples! I can close the (no) issue as solved now.