Closed oschwengers closed 7 years ago
@oschwengers I am happy to upgrade Infernal - I hadn't noticed the new version yet - thanks!
I don't recall filtering IncRNAs originally? (but my memory is bad)
I haven't updated the database, because originally Rfam provided a GFF file which I used to extract out only bacterial RNAs. However that file has gone and Rfam said they don't know how to do what I need. I don't want to use the whole Rfam database.
I've done a manual taxonomic search for bacteria via their web search and found 663 families.
As an alternative I've build a subset covering the following entry types based on v12.1 (18.01.17):
@tseemann if desired I could open a pull request to share it
@oschwengers My current set from 10.x has 564 families ie. RFxxxxx models.
cmconvert prokka/db/cm/Bacteria | grep -c ^ACC
564
It would be great if you could provide the accessions for the bacterial subsets! Maybe just email me at (my github username) at (unimelb dot edu dot au) ?
Also, i have updated the binaries to # INFERNAL 1.1.2 (July 2016)
@tseemann Great! I've sent you an email with all details and Rfam model ids
Thank @oschwengers !
I used cmfetch
with your list on the the latest Rfam 12.2 and have now doubled the number of ncRNA models it will find. I tested it on an E.coli and it moved from 153 to 218 ncRNA features!
Hello @tseemann and @oschwengers!
I read about this CM database update and was wondering before where to find the .gff3 that is mentioned in the README (https://github.com/tseemann/prokka/blob/f7f819b8b78ac61eb831775214e609f0917bc11a/db/cm/README). As @tseemann wrote above, this file seems not to be supplied by Rfam anymore.
Since you did not specify what exact criteria you used to get the Rfam family accessions, I took a shot at it. Here is my approach, which should be automatable.
Thanks to the public read-only Rfam MySQL DB, we can get a list for Bacteria, Viruses AND Archaea:
mysql --user rfamro --host mysql-rfam-public.ebi.ac.uk --port 4497 --database Rfam < query.sql > result.tab
query.sql will look like this:
SELECT DISTINCT f.rfam_acc, f.type, f.description
FROM taxonomy tx
INNER JOIN rfamseq rf ON rf.ncbi_id = tx.ncbi_id
INNER JOIN full_region fr ON fr.rfamseq_acc = rf.rfamseq_acc
INNER JOIN family f ON f.rfam_acc = fr.rfam_acc
WHERE (f.type LIKE 'Gene;'
OR f.type LIKE '%CRISPR;'
OR f.type LIKE '%antisense;'
OR f.type LIKE '%antitoxin;'
OR f.type LIKE '%miRNA;'
OR f.type LIKE '%ribozyme;'
OR f.type LIKE '%sRNA;'
OR f.type LIKE '%snRNA%'
OR f.type LIKE 'Intron;'
OR f.type LIKE 'Cis-reg;'
OR f.type LIKE '%IRES;'
OR f.type LIKE '%frameshift_element;'
OR f.type LIKE '%leader;'
OR f.type LIKE '%riboswitch;'
OR f.type LIKE '%thermoregulator;')
AND tx.tax_string LIKE 'Bacteria%';
Replace Bacteria
with Viruses
and Archaea
and you have three files that could be used with cmfetch to get the corresponding CMs.
Note: I did not include tRNA (predicted by aragorn), rRNA (predicted by barrnap, rnammer) and lncRNA (Eukaryotes only).
The number of models (Rfam 12.2) is:
As a side note: It seems, that the current CMs also include eukaryotic ones (e.g. RNaseP_nuc (RF00009)).
Let me know what you think and if you consider an update.
Regards
Hi, due to the release notes of Infernal 1.1.2 it has lost its earlier sequence length restriction:
I successfully ran a promising short test with v1.1.2. So, would it be possible to upgrade Infernal to 1.1.2 and then update the cm bacterial databases incorporating lncRNAs as they've been filtered out before (imho due to this restriction). By this, we would add >200 lncRNAs to the cm databases...
Best regards, Oliver