soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
174 stars 23 forks source link

[Question] Advice on setting up MetaEuk for 140 genomes? #89

Closed jolespin closed 5 months ago

jolespin commented 6 months ago

I'm wondering what the best implementation would be here. My MetaEuk database is ~6.1GB with ~30M proteins.

If you were to run MetaEuk on 140 genomes, would you: A) Concatenate all the genomes together and do one large run B) Split it out into batches (10 batches running 14 each) C) One at a time D) Something else?

I'm currently running A) but it's written about 1.3TB of temporary data to disk and I'm not sure how far long MetaEuk still needs to run based on the log file. It's been running for >2 days and I'm trying to figure out if the approach I took was best practice or not.

elileka commented 5 months ago

Hi,

Theoretically, joining several queries saves on indexing but I don't think the save in runtime is significant compared to the time the search step will take. Therefore, I would probably not join all of them together so in case something fails during a run, I don't need to re-run all. You can also post the log of a single genome run to see how much time will be saved from merging and see if it makes sense to merge.

If you DO NOT intend to run taxtocontig later, you could also set the flag --remove-tmp-files to 1, to cleanup after the each genome is finished.

Best, Eli