uw-ipd / RoseTTAFold2NA

RoseTTAFold2 protein/nucleic acid complex prediction
MIT License
310 stars 69 forks source link

Issues with RNA multiple sequence alignment #48

Open TheBil99 opened 1 year ago

TheBil99 commented 1 year ago

Hi, thanks for the amazing work!!

I am having some issues when trying to run the model for a protein-RNA docking. I first tried with the protein and RNA sequences you provided for the example and everything worked very well. But when trying with other sequences I get some errors.

The stderr files relative to the hhsearch and the protein msa do not contain errors. This is the content of the stderr file for the rna msa.

/aplic/noarch/software/RoseTTAFold2NA/0.2-Miniconda3-4.9.2/input_prep/make_rna_msa.sh: line 9: [: missing `]' rm: cannot remove ‘rfam1.list.split.’: No such file or directory rm: cannot remove ‘rfam2.list.split.’: No such file or directory rm: cannot remove ‘blastn1.list.split.*’: No such file or directory /aplic/noarch/software/RoseTTAFold2NA/0.2-Miniconda3-4.9.2/input_prep/make_rna_msa.sh: line 101: 221255 Aborted (core dumped) cd-hit-est-2d -T $CPU -i $in_fasta -i2 trim.db -c $cut -o cdhitest2d.db -l $throw_away_sequences -M 0 &>/dev/null grep: db: No such file or directory rm: cannot remove ‘cdhitest2d.db’: No such file or directory rm: cannot remove ‘cdhitest2d.db.clstr’: No such file or directory rm: cannot remove ‘db.clstr’: No such file or directory

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

grep: guide_RNA.afa: No such file or directory

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

Error: Failed to open target sequence database db for reading

Alignment input open failed. couldn't open nhmmer.a2m for reading

rm: cannot remove ‘nhmmer.a2m’: No such file or directory

protein MIRNKAFVVRLYPNAAQTELINRTLGSARFVYNHFLARRIAAYKESGKGLTYGQTSSELTLLKQAEETSWLSEVDKFALQNSLKNLETAYKNFFRTVKQSGKKVGFPRFRKKRTGESYRTQFTNNNIQIGEGRLKLPKLGWVKTKGQQDIQGKILNVTVRRIHEGHYEASVLCEVEIPYLPAAPKFAAGVAVGIKDFAIVTDGVRFKHEQNPKYYRSTLKRLRKAQQTLSRRKKGSARYGKAKTKLARIHKRIVNKRQDFLHKLTTSLVREYEIIGTEHLKPDNMRKNRRLALSISDAGWGEFIRQLEYKAAWYGRLVSKVSPYFPSSQLCHDCGFKNPEVKNLAVRTWTCPNCGETHDRDENAALNIRREALVAAGISDTLNAHGGYVRPASAGNGLRSENHATLVVSRADPKKK

RNA TCGGCGTGAAGCGTTGGTGGCTGCGGGAATCTCAGACACCTTAAACGCTCATGGAGGCTATGTCAGACCTGCTTCGGGGGCAATGGTCTGCGAAGTGAGAATCACGCGACTTTAGTCGTGTGAGGTTCAAGAGTCCCTTGGCGCCC

P.S.: I get the same issue also when trying to use this RNA sequence with the protein you provided in the example, but I really do not understand what is wrong with the RNA sequence I am using.

Thanks!!

bifxcore commented 1 year ago

@TheBil99 have you found what the error was? I have the same issue. Example works fine but with my own input files, I get a CD-HIT error and the run aborts. The script make_rna_msa.sh contains this:

overwrite=true
if [ -f $out_dir/$out_tag.afa -a $overwrite = false] # line 9
then
    exit 0
fi

the option '-a' is True if file exists. $overwrite is not a file so I'm assuming the 'overwrite' was used for testing and by default the script overwrites previous output if re-run (?!).

So I changed line 9 to:

if [[ -f $out_dir/$out_tag.afa && $overwrite = false ]]

That gets rid of the line 9: [: missing `]' error, but I still get the error with CD-HIT not finding the databases:

make_rna_msa.sh: line 101: 205786 Aborted (core dumped) 
cd-hit-est-2d -T $CPU -i $in_fasta -i2 trim.db -c $cut -o cdhitest2d.db -l $throw_away_sequences -M 0 &>/dev/null
grep: db: No such file or directory
rm: cannot remove ‘cdhitest2d.db’: No such file or directory
rm: cannot remove ‘cdhitest2d.db.clstr’: No such file or directory
rm: cannot remove ‘db.clstr’: No such file or directory

Error: Failed to open target sequence database db for reading

Alignment input open failed.
   couldn't open nhmmer.a2m for reading
etc...

The bit that fails is this line:

    cd-hit-est -T $CPU -i cdhitest2d.db -c $cut -o db -l $throw_away_sequences -M 0 &> /dev/null
    nhits=`grep '^>' db | wc -l` # db output is missing!

It looks like cd-hit-est-2d failed to output cdhitest2d.db and then it fails to output db . My RNA sequence is 76 bases long, which means $throw_away_sequences=30. Maybe the 46 bases remaining somehow don't align with anything in trim.db? Hard to believe, my RNA sequences is a tRNA from Pseudomonas, it should get plenty of hits in GenBank?!

After days of installing and downloading > 2.5TB of data, I was really hoping to get this to work. Grateful for any hints.

bifxcore commented 1 year ago

The CD-HIT error turns out to be a known issue: https://github.com/weizhongli/cdhit/issues/26

This worked for me: uninstall the cd-hit conda package (and wait...wait some more... takes forever to complete), build from source with MAX_SEQ=10000000 and add the bin dir to my PATH: cd-hit-est-2d now runs fine.

huangtinglin commented 1 year ago

@bifxcore Thanks for sharing! I was wondering how many MSA of RNA have been retrieved generally in your cases. In my data, there is hardly any MSA found for RNA or only a few MSAs at most.

bifxcore commented 12 months ago

@huangtinglin my MSA contained over 6K RNA sequences. Note that I filtered out all those that contained non-standard bases, to avoid the msa parser AssertionError (see https://github.com/uw-ipd/RoseTTAFold2NA/issues/27).

huangtinglin commented 12 months ago

@huangtinglin my MSA contained over 6K RNA sequences. Note that I filtered out all those that contained non-standard bases, to avoid the msa parser AssertionError (see #27).

Interesting...I have no idea why my RNA sequences can only get a few MSAs. Could you provide an example of the RNA sequence you use? Thanks!

suvrajitm commented 9 months ago

Hi RoseTTAFold2NA team, I have a similar question, could you please point to the code module where the protein-RNA binding/docking is implemented and what parameters, both output and input if any, determine the binding strength? Thank you.

MrGao1997 commented 7 months ago

Hi,. I meet the same question, and did not know how to fix it. Do you have deal with it? Should I re-download the dataset or reset the environment?