Building a custom PDB sequence database for Hhblit

Hello,

I am trying to follow the mentioned steps in the wiki page to construct a custom PDB database. However, I have some questions about some steps. Please excuse me in advance if my questions are two basics or have been answered in other posts that I didn't see, unfortunately.

According to the tutorial, the first step is to download the PDB using rsync. rsync --progress -rlpt -v -z --port=33444 rsync.wwpdb.org::ftp/data/structures/divided/mmCIF . The latter will download all the PDB structures, even the ones corresponding to non-protein entries (i.e., nucleic acids). How are these files handled/eliminated in the next steps? (they should since we are only interested in protein sequences right?).

The next step is to generate FASTA sequences of the proteins using cif2fasta.py. Since proteins in the PDB may carry mutations and missing regions, which sequence this tool will output: the one as it is in the PDB (engineered sequence) or the canonical one (as can be found in UniProt DB)?

Many thanks in advance for your clarifications.

soedinglab / hh-suite

Building a custom PDB sequence database for Hhblit #300