Closed ealdraed closed 5 years ago
I read about this CM database update and was wondering before where to find the .gff3 that is mentioned in the README (https://github.com/tseemann/prokka/blob/f7f819b8b78ac61eb831775214e609f0917bc11a/db/cm/README). As @tseemann wrote above, this file seems not to be supplied by Rfam anymore.
Since you did not specify what exact criteria you used to get the Rfam family accessions, I took a shot at it. Here is my approach, which should be automatable.
Thanks to the public read-only Rfam MySQL DB, we can get a list for Bacteria, Viruses AND Archaea:
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < query.sql > result.tab
query.sql will look like this:
SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE (f.type LIKE 'Gene;'
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Bacteria%';
Replace Bacteria with Viruses and Archaea and you have three files that could be used with cmfetch to get the corresponding CMs.
Note: I did not include tRNA (predicted by aragorn), rRNA (predicted by barrnap, rnammer) and lncRNA (Eukaryotes only).
The number of models (Rfam 12.2) is:
Bacteria: 761 Viruses: 222 Archaea: 109 As a side note: It seems, that the current CMs also include eukaryotic ones (e.g. RNaseP_nuc (RF00009)).
Let me know what you think and if you consider an update.
Regards
Hey @ealdraed , thanks for the hint! Your solution definitely sounds like a more solid approach. For the status quo I searched for bacterial Rfam families via the web interface. Maybe this led to some erroneous Rfam entries.
@tseemann, aren't the taxid the numbers appearing in the first column of the taxonomy.txt files from Rfam? I checked some and they match the NCBI taxonomy. ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.3/database_files/taxonomy.txt.gz
PS: I need to annotate a bunch of Archaea and it would be great to have a way around before the next release. I tried the mysql approach but there might be a typo or something in the query.sql because nothing happens
@xvazquezc
I just executed the command again. I will attach the tab-separated results for Archaea, Bacteria and Viruses for your convenience. Please note that the files contain a header line.
Rfam was recently updated to 12.3. Here are the changes in family numbers (tRNA, rRNA and lncRNA entry types not included):
Rfam | 12.2 | 12.3 | Removed | Added | Renamed |
---|---|---|---|---|---|
Archaea | 109 | 109 | 0 | 0 | 0 |
Bacteria | 761 | 829 | 1 | 69 | 1 |
Viruses | 222 | 225 | 1 | 4 | 0 |
Rfam_archaea_12.2.txt Rfam_archaea_12.3.txt (same as above) Rfam_bacteria_12.2.txt Rfam_bacteria_12.3.txt Rfam_viruses_12.2.txt Rfam_viruses_12.3.txt
Hello @tseemann, @oschwengers, @xvazquezc, hello all!
The current Rfam version is now 13.0. I noticed that the above SQL query included tmRNA
which I think should also be omitted since aragorn
takes care of that. The following is the updated query:
SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE ((f.type LIKE 'Gene;' AND f.description NOT LIKE '%transfer-messenger RNA')
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Archaea%';
Here are the numbers of families (excluding tRNA, tmRNA, rRNA and lncRNA):
Rfam | 13.0 |
---|---|
Archaea | 115 |
Bacteria | 850 |
Viruses | 210 |
I also like to give a small outline of the process of getting the latest Rfam covariance models (CMs) based on the above SQL.
Save the above query into three text files (e.g. Rfam_archaea.sql
, Rfam_bacteria.sql
, Rfam_viruses.sql
) and make sure to replace Archaea
in the last line with Bacteria
and Viruses
, respectively. With these three files you query the Rfam database:
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_archaea.sql > Rfam_archaea_13.0.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_bacteria.sql > Rfam_bacteria_13.0.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_viruses.sql > Rfam_viruses_13.0.txt
For your convenience, here is the output of the above commands: Rfam_archaea_13.0.txt Rfam_bacteria_13.0.txt Rfam_viruses_13.0.txt
Remove header in *.txt
files (you can also use tail -n +2
or do it manually):
sed -i '1d' Rfam_archaea_13.0.txt
sed -i '1d' Rfam_bacteria_13.0.txt
sed -i '1d' Rfam_viruses_13.0.txt
Download Rfam CMs and extract families:
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/13.0/Rfam.cm.gz
gunzip Rfam.cm.gz
cmfetch -o Rfam_archaea.cm -f Rfam.cm Rfam_archaea_13.0.txt
cmfetch -o Rfam_bacteria.cm -f Rfam.cm Rfam_bacteria_13.0.txt
cmfetch -o Rfam_viruses.cm -f Rfam.cm Rfam_viruses_13.0.txt
Finally convert to binary format and remove intermediate files:
cmconvert -o Archaea -b Rfam_archaea.cm
cmconvert -o Bacteria -b Rfam_bacteria.cm
cmconvert -o Viruses -b Rfam_viruses.cm
rm Rfam.cm
rm Rfam_archaea{.cm,_13.0.txt}
rm Rfam_bacteria{.cm,_13.0.txt}
rm Rfam_viruses{.cm,_13.0.txt}
Make sure to copy the files to prokka's db/cm
folder. If you are working with Archaea, you must edit the Prokka source code, since the --setupdb
command only cmpress
es Bacteria
and Viruses
at the moment. Edit this line to contain ,Archaea
. Then run prokka --setupdb
and you should be good to start your next annotation.
Hello @ealdraed
I had the same issue with CM database setup so I followed your solution. Because the current Rfam version is 14.1, I tried to get Rfamarchaea14.1.txt, Rfambacteria14.1.txt, Rfamviruses14.1.txt applying the following command. However, it says no such file or directory
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_archaea.sql > Rfam_archaea_14.1.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_bacteria.sql > Rfam_bacteria_14.1.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_viruses.sql > Rfam_viruses_14.1.txt
I also tried to find the files from their website (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.1/) but couldn't find the files.
Do you know how to get those up-to-date files? if so, could you tell me how to get them?
Thanks very much
Hello @djung0928 and sorry for the delayed response.
Do you have prepared the SQL files mentioned above?
E.g. Rfam_bacteria.sql:
SELECT DISTINCT f.rfam_acc, f.type, f.description FROM taxonomy tx INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc INNER JOIN family f ON f.rfam_acc = fr.rfam_acc WHERE ((f.type LIKE 'Gene;' AND f.description NOT LIKE '%transfer-messenger RNA') OR f.type LIKE '%CRISPR;' OR f.type LIKE '%antisense;' OR f.type LIKE '%antitoxin;' OR f.type LIKE '%miRNA;' OR f.type LIKE '%ribozyme;' OR f.type LIKE '%sRNA;' OR f.type LIKE '%snRNA%' OR f.type LIKE 'Intron;' OR f.type LIKE 'Cis-reg;' OR f.type LIKE '%IRES;' OR f.type LIKE '%frameshift_element;' OR f.type LIKE '%leader;' OR f.type LIKE '%riboswitch;' OR f.type LIKE '%thermoregulator;') AND tx.tax_string LIKE 'Bacteria%';
The mentioned text files are then output after the mysql
command has finished.
I was investigating some discrepancies between Prokka ncRNA annotations and annotation made directly by Infernal+Rfam using the recommended procedure. That's what led me to this thread.
@tseemann, @ealdraed's post on Nov 3 2017 looks great, although at this point it needs to be updated for Rfam 14.x. I hope to submit a pull request with updated Viral, Bacterial, and Archaeal CMs in the next couple of days.
@standage Prokka uses nhmmer
for speed. Ideally one would use cmscan
.
@tseemann Really? It looks like the latest master invokes cmscan
: https://github.com/tseemann/prokka/blob/17d953461c2175347d49531b4bdcbed87ccd04eb/bin/prokka#L624.
@standage Whoops I got confused! barrnap
my rRNA predictor uses nhmmer
.
Hello @tseemann !
I commented on #207, which is a closed issue and was wondering if it actually caught your attention. I open this new issue as I think the current covariance models might include eukaryotic sequences. Additionally, we could include Archaea!