soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

How can I use metaeuk to annotation genome without reference #46

Open Nana7m1 opened 2 years ago

Nana7m1 commented 2 years ago

Dear developer and other users, As the title says, I wanna use metaeuk to annotation genome without reference. But I cannot find how to deal with it in manual.

Best Nana7m1

elileka commented 2 years ago

Hello,

The way to do it is to download or construct a reference database to run against. What do you know about your genome? What taxonomic group is it? I could try to provide further advice based on your answer :) Once you have the reference database at hand, you could use easy-predict to find similar genes in your input genome.

Best, Eli

tiantianlili commented 11 months ago

Hello,

The way to do it is to download or construct a reference database to run against. What do you know about your genome? What taxonomic group is it? I could try to provide further advice based on your answer :) Once you have the reference database at hand, you could use easy-predict to find similar genes in your input genome.

Best, Eli

Hello, thank you for developing this software. I would like to follow up this question. I obtained contigs with a length greater than 1kbp from the metagenome data of soil contaminated with heavy metals. I noticed that there are many reference datasets of mmseqs recommended by you, some of which are nucleic acid databases (https://github.com/soedinglab/MMseqs2/wiki#downloading-databases). May I ask which database is the most suitable for me (SILVA )?

elileka commented 10 months ago

Hi,

As a reference DB MetaEuk takes in either protein or protein profiles. Therefore the nucleotide DBs available thorough the databases command, including SILVA, are not relevant.

Choosing the right protein/protein profile DB depends on your scientific goal. Here are two ideas I have, based on the details you provided:

You can also have a look at Busco if you are interested in estimating the geneomic completeness of specific organisms via single-copy marker genes of various phylogenetic groups. BUSCO uses MetaEuk internally.

Best, Eli

tiantianlili commented 10 months ago

Hi,

As a reference DB MetaEuk takes in either protein or protein profiles. Therefore the nucleotide DBs available thorough the command, including SILVA, are not relevant.databases

Choosing the right protein/protein profile DB depends on your scientific goal. Here are two ideas I have, based on the details you provided:

  • UniRef50 can be a good start to find homologs for proteins, which mostly were not discovered through metagenomic experiments. This DB can be downloaded thorough the command and it has taxonomic and other info, which can be used to annotate your sample.databases
  • If you are mainly interested in discovering homologs of rare, environmental proteins and less in annotation, you can download one of these DBs. Specifically, SRC (soil) and BFD seem most suitable for your sample. However, note that (1) environmental DBs like these are generally not annotated and that (2) these DBs are large: 200-300 Gb, which means higher requirements (storage, runtime, etc.) so I would first test on smaller scales.

You can also have a look at Busco if you are interested in estimating the geneomic completeness of specific organisms via single-copy marker genes of various phylogenetic groups. BUSCO uses MetaEuk internally.

Best, Eli

Thank you very much for your detailed reply. I'll try the UniRef50 and SRC databases first, hopefully with good results.

Best li tian