hoelzer-lab / hypro

Extend hypothetical prokka protein annotations using additional homology searches against larger databases
GNU General Public License v3.0
9 stars 0 forks source link

additional databases #6

Closed marlt closed 4 years ago

hoelzer commented 4 years ago

We leave this issue open for later reference. At the moment having the uniprot-swissprot is fine.

marlt commented 4 years ago

https://www.uniprot.org/downloads#uniprotkblink trembl uniref

marlt commented 4 years ago

Unref100/90/50, differ by several GB

marlt commented 4 years ago

Since uniref50 did not show the desired extension of prokka hyprots, is the rcsb protein DB worth a try? Also, we could try a different test genome to see not only results on chlamydia ...

marlt commented 4 years ago

Another idea: make the sensitivity of mmseqs adjustable - default is 5.7, it can be customized between 1.0 (fast and less sensitive) to 7.5

hoelzer commented 4 years ago

@rcsb db: I don't know this database. The question is, how easy can you switch to another FASTA format?

@test genome: also a good idea, we are just testing Chlamydia. Another Bacteria would be also a good try. Maybe Mycoplasma bovis? But attention, here the gene code for prokka must be switched from 1 to 4. Or Mycobacterium paratuberculosis. You can download a genome we assembled years ago here: https://www.rna.uni-jena.de/supplements/mycobacterium/genomes/MycAviPar386.final.fasta

I think the M. para would be a good test case.

@sensitiviy 5.7 - what does this number tell me? I have the feeling we can simply go with high sensitivity and set this to 7.5. but maybe this will not change much

marlt commented 4 years ago

I tried to alter the sensitivity to sth else than 5.7 - always crashing with segfault - I left this out for now, since it isnt so promesing anyway... The pdb is a protein structure database, using sequences from uniprot, ncbi gen, pfam, ExPasy, pubmed and SCOP. It is curated and fasta sequences available - homepage says curated...and the fasta is small, just 30MB.

marlt commented 4 years ago

SO uniref50 functions properly now I would say. It gives some enhancement to the hyprots compared to uniprotkb. pdb is still included, but gives not really more information.