tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
833 stars 226 forks source link

upgrade Infernal to 1.1.2 #207

Closed oschwengers closed 7 years ago

oschwengers commented 7 years ago

Hi, due to the release notes of Infernal 1.1.2 it has lost its earlier sequence length restriction:

Infernal 1.1.2 is the second update release for Infernal 1.1. Notable changes from 1.1.1: cmscan is significantly improved, and can now be used for genome annotation; enhancements include:

  • speed improvements due to storage of models in memory instead of rereading from disk for each query sequence
  • overlapping hits are annotated in tabular output files with the --tblout --fmt 2 option combination o clan membership (a la Rfam) is annotated in tabular output files with the --tblout --fmt 2 --clanin option combination.
  • there is no longer a maximum query sequence length

I successfully ran a promising short test with v1.1.2. So, would it be possible to upgrade Infernal to 1.1.2 and then update the cm bacterial databases incorporating lncRNAs as they've been filtered out before (imho due to this restriction). By this, we would add >200 lncRNAs to the cm databases...

Best regards, Oliver

tseemann commented 7 years ago

@oschwengers I am happy to upgrade Infernal - I hadn't noticed the new version yet - thanks!

I don't recall filtering IncRNAs originally? (but my memory is bad)

I haven't updated the database, because originally Rfam provided a GFF file which I used to extract out only bacterial RNAs. However that file has gone and Rfam said they don't know how to do what I need. I don't want to use the whole Rfam database.

oschwengers commented 7 years ago

I've done a manual taxonomic search for bacteria via their web search and found 663 families.

As an alternative I've build a subset covering the following entry types based on v12.1 (18.01.17):

@tseemann if desired I could open a pull request to share it

tseemann commented 7 years ago

@oschwengers My current set from 10.x has 564 families ie. RFxxxxx models.

cmconvert prokka/db/cm/Bacteria | grep -c ^ACC
564

It would be great if you could provide the accessions for the bacterial subsets! Maybe just email me at (my github username) at (unimelb dot edu dot au) ?

Also, i have updated the binaries to # INFERNAL 1.1.2 (July 2016)

oschwengers commented 7 years ago

@tseemann Great! I've sent you an email with all details and Rfam model ids

tseemann commented 7 years ago

Thank @oschwengers !

I used cmfetch with your list on the the latest Rfam 12.2 and have now doubled the number of ncRNA models it will find. I tested it on an E.coli and it moved from 153 to 218 ncRNA features!

ealdraed commented 7 years ago

Hello @tseemann and @oschwengers!

I read about this CM database update and was wondering before where to find the .gff3 that is mentioned in the README (https://github.com/tseemann/prokka/blob/f7f819b8b78ac61eb831775214e609f0917bc11a/db/cm/README). As @tseemann wrote above, this file seems not to be supplied by Rfam anymore.

Since you did not specify what exact criteria you used to get the Rfam family accessions, I took a shot at it. Here is my approach, which should be automatable.

Thanks to the public read-only Rfam MySQL DB, we can get a list for Bacteria, Viruses AND Archaea:

mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < query.sql > result.tab

query.sql will look like this:

SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE (f.type LIKE 'Gene;'
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Bacteria%';

Replace Bacteria with Viruses and Archaea and you have three files that could be used with cmfetch to get the corresponding CMs.

Note: I did not include tRNA (predicted by aragorn), rRNA (predicted by barrnap, rnammer) and lncRNA (Eukaryotes only).

The number of models (Rfam 12.2) is:

As a side note: It seems, that the current CMs also include eukaryotic ones (e.g. RNaseP_nuc (RF00009)).

Let me know what you think and if you consider an update.

Regards