google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.57k stars 2.23k forks source link

Illegal character in pub_seqres.txt database file #623

Closed doricke closed 1 year ago

doricke commented 1 year ago

Parse failed (sequence file /db/pdb_seqres/pdb_seqres.txt): Line 1366022: illegal character 0

Errors in file:

7ooo_B mol:na length:11 DNA (5'-D(CPTP(RWQ)PTPCPTPTPTPG)-3') CT05ATCTTTG 7ooo_E mol:na length:11 DNA (5'-D(CPTP(RWQ)PTPCPTPTPTPG)-3') CT05ATCTTTG 7ozz_B mol:na length:11 DNA (5'-D(CPTP(RWR)PTPCPTPTPTP*G)-3') CT05HTCTTTG

This is likely a PDB parsing error. For 7ozz, the 05H is likely coming from the HETATM entries: ATOM 251 O4 DT B 12 -29.103 0.246 5.217 1.00 94.87 O ATOM 252 C5 DT B 12 -28.572 1.806 6.922 1.00100.00 C ATOM 253 C7 DT B 12 -29.343 2.919 6.285 1.00 97.49 C ATOM 254 C6 DT B 12 -27.896 1.952 8.066 1.00102.37 C HETATM 255 C1 05H B 13 -19.291 0.434 4.896 1.00100.82 C HETATM 256 C1' 05H B 13 -20.815 -2.819 1.936 1.00 98.67 C HETATM 257 C11 05H B 13 -25.832 1.973 4.789 1.00 82.44 C HETATM 258 C2 05H B 13 -22.765 -2.247 0.431 1.00 92.33 C HETATM 259 C2' 05H B 13 -19.788 -3.180 0.872 1.00 98.03 C HETATM 260 C21 05H B 13 -23.920 -1.793 5.039 1.00 89.93 C HETATM 261 C3 05H B 13 -20.565 -0.673 8.783 1.00 98.27 C HETATM 262 C3' 05H B 13 -18.651 -2.241 1.179 1.00 97.27 C HETATM 263 C4 05H B 13 -23.763 -0.151 0.487 1.00 89.17 C HETATM 264 C4' 05H B 13 -18.700 -2.056 2.670 1.00 96.22 C HETATM 265 C41 05H B 13 -25.360 -0.401 3.891 1.00 84.99 C HETATM 266 C5 05H B 13 -22.810 0.257 1.532 1.00 92.64 C HETATM 267 C5' 05H B 13 -18.184 -0.672 2.999 1.00 90.23 C HETATM 268 C51 05H B 13 -25.154 0.642 4.907 1.00 85.30 C HETATM 269 C6 05H B 13 -21.878 -0.660 1.967 1.00 90.50 C HETATM 270 C61 05H B 13 -24.340 0.369 5.978 1.00 96.29 C HETATM 271 C7 05H B 13 -22.853 1.639 2.117 1.00 86.19 C HETATM 272 C71 05H B 13 -19.481 0.396 6.382 1.00107.13 C HETATM 273 N1 05H B 13 -21.833 -1.885 1.435 1.00 88.64 N HETATM 274 N11 05H B 13 -23.699 -0.813 6.045 1.00 93.76 N HETATM 275 N3 05H B 13 -23.683 -1.382 -0.001 1.00 91.20 N HETATM 276 N31 05H B 13 -24.720 -1.558 4.007 1.00 87.08 N HETATM 277 N5' 05H B 13 -18.512 -0.505 4.395 1.00 99.47 N HETATM 278 O2 05H B 13 -22.767 -3.380 -0.090 1.00 94.75 O HETATM 279 O2' 05H B 13 -20.614 -1.724 7.810 1.00107.20 O HETATM 280 O21 05H B 13 -23.387 -2.920 5.061 1.00102.39 O HETATM 281 O3 05H B 13 -19.840 1.281 4.218 1.00 97.98 O HETATM 282 O3' 05H B 13 -17.346 -2.673 0.724 1.00106.63 O HETATM 283 O4 05H B 13 -24.624 0.643 0.067 1.00 90.61 O HETATM 284 O4' 05H B 13 -20.087 -2.179 2.993 1.00108.67 O HETATM 285 O41 05H B 13 -26.111 -0.183 2.926 1.00 91.45 O HETATM 286 O5' 05H B 13 -21.611 2.197 9.740 1.00102.52 O HETATM 287 OP1 05H B 13 -21.601 4.692 10.303 1.00137.19 O HETATM 288 OP2 05H B 13 -22.877 3.934 8.214 1.00 90.96 O HETATM 289 P 05H B 13 -22.373 3.591 9.699 1.00130.32 P HETATM 290 C1'1 05H B 13 -22.792 -1.087 7.175 1.00 93.73 C HETATM 291 C2'1 05H B 13 -21.353 -1.242 6.698 1.00105.90 C HETATM 292 C3'1 05H B 13 -20.955 0.194 6.627 1.00110.52 C HETATM 293 C4'1 05H B 13 -21.317 0.436 8.057 1.00108.99 C HETATM 294 C5'1 05H B 13 -21.020 1.859 8.518 1.00104.57 C HETATM 295 O4'1 05H B 13 -22.699 0.045 8.059 1.00110.59 O ATOM 296 P DT B 14 -16.630 -1.995 -0.578 1.00 97.66 P ATOM 297 OP1 DT B 14 -15.200 -2.365 -0.526 1.00104.42 O ATOM 298 OP2 DT B 14 -17.060 -0.575 -0.633 1.00 98.23 O

dsclassen commented 1 year ago

I also am getting a similar error during the hmmbuild query step :

I1110 08:13:51.837537 140292054591296 run_docker.py:255] Parse failed (sequence file /mnt/pdb_seqres_database_path/pdb_seqres.txt):
I1110 08:13:51.837775 140292054591296 run_docker.py:255] Line 1366492: illegal character 0

Line 1366491-1366492 of the pdb_seqres.txt looks like this:

>7ooo_B mol:na length:11  DNA (5'-D(*CP*TP*(RWQ)P*TP*CP*TP*TP*TP*G)-3')
CT05ATCTTTG
Augustin-Zidek commented 1 year ago

You can fix this by filtering out the bad sequences from pdb_seqres.txt. We will submit a fix to the download script for this.

To fix for now, run the following 2 commands in the directory with pdb_seqres.txt:

grep --after-context=1 --no-group-separator '>.* mol:protein' "pdb_seqres.txt" > "pdb_seqres_filtered.txt"
mv "pdb_seqres_filtered.txt" "pdb_seqres.txt"
dsclassen commented 1 year ago

Thank you. This worked for me. I am now able to run AF2 with --model_preset=multimer

Augustin-Zidek commented 1 year ago

Thanks, we fixed this in the pdb_seqres download script in v2.3.0.