tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
833 stars 226 forks source link

Infernal (CM) database update (Bacteria/Archaea/Viruses) #243

Closed ealdraed closed 5 years ago

ealdraed commented 7 years ago

Hello @tseemann !

I commented on #207, which is a closed issue and was wondering if it actually caught your attention. I open this new issue as I think the current covariance models might include eukaryotic sequences. Additionally, we could include Archaea!

tseemann commented 7 years ago

Thanks for re-opening it - i did somehow miss i, sorry! I paste it below:

I read about this CM database update and was wondering before where to find the .gff3 that is mentioned in the README (https://github.com/tseemann/prokka/blob/f7f819b8b78ac61eb831775214e609f0917bc11a/db/cm/README). As @tseemann wrote above, this file seems not to be supplied by Rfam anymore.

Since you did not specify what exact criteria you used to get the Rfam family accessions, I took a shot at it. Here is my approach, which should be automatable.

Thanks to the public read-only Rfam MySQL DB, we can get a list for Bacteria, Viruses AND Archaea:

mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < query.sql > result.tab
query.sql will look like this:

SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE (f.type LIKE 'Gene;'
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Bacteria%';

Replace Bacteria with Viruses and Archaea and you have three files that could be used with cmfetch to get the corresponding CMs.

Note: I did not include tRNA (predicted by aragorn), rRNA (predicted by barrnap, rnammer) and lncRNA (Eukaryotes only).

The number of models (Rfam 12.2) is:

Bacteria: 761 Viruses: 222 Archaea: 109 As a side note: It seems, that the current CMs also include eukaryotic ones (e.g. RNaseP_nuc (RF00009)).

Let me know what you think and if you consider an update.

Regards

oschwengers commented 7 years ago

Hey @ealdraed , thanks for the hint! Your solution definitely sounds like a more solid approach. For the status quo I searched for bacterial Rfam families via the web interface. Maybe this led to some erroneous Rfam entries.

xvazquezc commented 7 years ago

@tseemann, aren't the taxid the numbers appearing in the first column of the taxonomy.txt files from Rfam? I checked some and they match the NCBI taxonomy. ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.3/database_files/taxonomy.txt.gz

PS: I need to annotate a bunch of Archaea and it would be great to have a way around before the next release. I tried the mysql approach but there might be a typo or something in the query.sql because nothing happens

ealdraed commented 7 years ago

@xvazquezc

I just executed the command again. I will attach the tab-separated results for Archaea, Bacteria and Viruses for your convenience. Please note that the files contain a header line.

Rfam was recently updated to 12.3. Here are the changes in family numbers (tRNA, rRNA and lncRNA entry types not included):

Rfam 12.2 12.3 Removed Added Renamed
Archaea 109 109 0 0 0
Bacteria 761 829 1 69 1
Viruses 222 225 1 4 0

Rfam_archaea_12.2.txt Rfam_archaea_12.3.txt (same as above) Rfam_bacteria_12.2.txt Rfam_bacteria_12.3.txt Rfam_viruses_12.2.txt Rfam_viruses_12.3.txt

ealdraed commented 7 years ago

Hello @tseemann, @oschwengers, @xvazquezc, hello all!

The current Rfam version is now 13.0. I noticed that the above SQL query included tmRNA which I think should also be omitted since aragorn takes care of that. The following is the updated query:

SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE ((f.type LIKE 'Gene;' AND f.description NOT LIKE '%transfer-messenger RNA')
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Archaea%';

Here are the numbers of families (excluding tRNA, tmRNA, rRNA and lncRNA):

Rfam 13.0
Archaea 115
Bacteria 850
Viruses 210

I also like to give a small outline of the process of getting the latest Rfam covariance models (CMs) based on the above SQL.

Save the above query into three text files (e.g. Rfam_archaea.sql, Rfam_bacteria.sql, Rfam_viruses.sql) and make sure to replace Archaea in the last line with Bacteria and Viruses, respectively. With these three files you query the Rfam database:

mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_archaea.sql > Rfam_archaea_13.0.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_bacteria.sql > Rfam_bacteria_13.0.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_viruses.sql > Rfam_viruses_13.0.txt

For your convenience, here is the output of the above commands: Rfam_archaea_13.0.txt Rfam_bacteria_13.0.txt Rfam_viruses_13.0.txt

Remove header in *.txt files (you can also use tail -n +2 or do it manually):

sed -i '1d' Rfam_archaea_13.0.txt
sed -i '1d' Rfam_bacteria_13.0.txt
sed -i '1d' Rfam_viruses_13.0.txt

Download Rfam CMs and extract families:

wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/13.0/Rfam.cm.gz
gunzip Rfam.cm.gz
cmfetch -o Rfam_archaea.cm -f Rfam.cm Rfam_archaea_13.0.txt
cmfetch -o Rfam_bacteria.cm -f Rfam.cm Rfam_bacteria_13.0.txt
cmfetch -o Rfam_viruses.cm -f Rfam.cm Rfam_viruses_13.0.txt

Finally convert to binary format and remove intermediate files:

cmconvert -o Archaea -b Rfam_archaea.cm
cmconvert -o Bacteria -b Rfam_bacteria.cm
cmconvert -o Viruses -b Rfam_viruses.cm
rm Rfam.cm
rm Rfam_archaea{.cm,_13.0.txt}
rm Rfam_bacteria{.cm,_13.0.txt}
rm Rfam_viruses{.cm,_13.0.txt}

Make sure to copy the files to prokka's db/cm folder. If you are working with Archaea, you must edit the Prokka source code, since the --setupdb command only cmpresses Bacteria and Viruses at the moment. Edit this line to contain ,Archaea. Then run prokka --setupdb and you should be good to start your next annotation.

djung0928 commented 5 years ago

Hello @ealdraed

I had the same issue with CM database setup so I followed your solution. Because the current Rfam version is 14.1, I tried to get Rfamarchaea14.1.txt, Rfambacteria14.1.txt, Rfamviruses14.1.txt applying the following command. However, it says no such file or directory

mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_archaea.sql > Rfam_archaea_14.1.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_bacteria.sql > Rfam_bacteria_14.1.txt
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < Rfam_viruses.sql > Rfam_viruses_14.1.txt

I also tried to find the files from their website (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.1/) but couldn't find the files.

Do you know how to get those up-to-date files? if so, could you tell me how to get them?

Thanks very much

ealdraed commented 5 years ago

Hello @djung0928 and sorry for the delayed response.

Do you have prepared the SQL files mentioned above?

E.g. Rfam_bacteria.sql:

SELECT DISTINCT f.rfam_acc, f.type, f.description FROM taxonomy tx INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc INNER JOIN family f ON f.rfam_acc = fr.rfam_acc WHERE ((f.type LIKE 'Gene;' AND f.description NOT LIKE '%transfer-messenger RNA') OR f.type LIKE '%CRISPR;' OR f.type LIKE '%antisense;' OR f.type LIKE '%antitoxin;' OR f.type LIKE '%miRNA;' OR f.type LIKE '%ribozyme;' OR f.type LIKE '%sRNA;' OR f.type LIKE '%snRNA%' OR f.type LIKE 'Intron;' OR f.type LIKE 'Cis-reg;' OR f.type LIKE '%IRES;' OR f.type LIKE '%frameshift_element;' OR f.type LIKE '%leader;' OR f.type LIKE '%riboswitch;' OR f.type LIKE '%thermoregulator;') AND tx.tax_string LIKE 'Bacteria%';

The mentioned text files are then output after the mysql command has finished.

standage commented 5 years ago

I was investigating some discrepancies between Prokka ncRNA annotations and annotation made directly by Infernal+Rfam using the recommended procedure. That's what led me to this thread.

@tseemann, @ealdraed's post on Nov 3 2017 looks great, although at this point it needs to be updated for Rfam 14.x. I hope to submit a pull request with updated Viral, Bacterial, and Archaeal CMs in the next couple of days.

tseemann commented 5 years ago

@standage Prokka uses nhmmer for speed. Ideally one would use cmscan.

standage commented 5 years ago

@tseemann Really? It looks like the latest master invokes cmscan: https://github.com/tseemann/prokka/blob/17d953461c2175347d49531b4bdcbed87ccd04eb/bin/prokka#L624.

tseemann commented 5 years ago

@standage Whoops I got confused! barrnap my rRNA predictor uses nhmmer.