kad-ecoli / rna3db

maintain local copy of RNA structure database
0 stars 0 forks source link

Output of cleaning script for some molecules is blank #7

Closed marc-harary closed 3 years ago

marc-harary commented 3 years ago

Some of the cleaned .pdb files are blank except for TER keyword, producing blank FASTA files, as well. One such example is molecule 1h3e. Is this just because all of the pairs in the original file are non-canonical? Here's the script I wrote for reference.

output_dir=/gpfs/ysm/home/mah258/Pyle/evaluate/output/cleaned
script_dir=/gpfs/ysm/home/mah258/Pyle/rna3db/script
data_dir=/gpfs/ysm/home/mah258/Pyle/PDB_dataset

cd $output_dir
files=($data_dir/*_sequences/*)

for file in ${files[*]}; do
    mol=$(basename $file)
    mol=$(echo $mol | cut -d "-" -f 1)
    $script_dir/fetch.py $mol
    $script_dir/clean_pdb.py $mol.pdb clean.pdb -StartIndex=1 -NewChainID=_
    $script_dir/pdb2fasta.py clean.pdb > $mol.fasta
    echo "# ${mol}" > $mol.label
    echo "i             j" >> $mol.label
    $script_dir/x3dna-dssr  -i=clean.pdb --pair-only|grep -P "( WC )|( Wobble )"|cut -c6-35|sed 's/[A-Z]//g' >> $mol.label
done
kad-ecoli commented 3 years ago

The following line is incorrect:

mol=$(echo $mol | cut -d "-" -f 1)

It should be

mol=$(echo $mol | cut -d "-" -f 1,3|sed 's/-//g')

In other words, 1h3e-1-B corresponds to mol=1h3eB. Here, the chain ID B cannot be omitted, because a single PDB could include multiple chains. For example, 1h3e includes two chains, where chain A is a protein while chain B is an RNA.