eggNOG_annotation takes very long

SilasK commented 3 years ago

1) This genecatalog step is taking me soo long this time :-) (10 days now for 10 samples). I know it's the most time consuming step, but is there some way I can run this more effectively on a single machine please? I have only 3 days of analysis time, and each time it aborts if I go over my time limit, I hope it restarts from the last subset? You mention something about a virtual machine in the config. I don't know exactly how to do this. Can I use temporarily my scratch space? Can I specify 'true', and then in the second line, do I have to give the path to my scratch, or it doesn't work like that?
```
eggNOG_use_virtual_disk: false   # coping the eggNOG DB to a virtual disk can sppeed up the annotation
virtual_disk: /dev/shm           # But you need 37G extra ram
```
This below is how it runs in the log, do you know why the mem is only 20, because I specified 160 in the config?

[Sun Jan 3 13:44:22 2021] rule eggNOG_annotation: input: /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog.db, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog_proteins.dmnd, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/checksum_checked, Genecatalog/subsets/genes/subset3.emapper.seed_orthologs output: Genecatalog/subsets/genes/subset3.emapper.annotations log: Genecatalog/subsets/genes/logs/subset3/eggNOG_annotate_hits_table.log jobid: 1075 wildcards: folder=Genecatalog/subsets/genes, prefix=subset3 threads: 36 resources: mem=20, time=5

SilasK commented 3 years ago

Hey @Sofie8 I put this question online as it might also help others.

So yes you can use the virtual memory for the eggNOg mapper step. scratch should be fine but the default /dev/shm is even better if it exists on your server.

If you set

eggNOG_use_virtual_disk: true

the eggNog db gets copied to the virtual disk and accelerates the access.

SilasK commented 3 years ago

Why does it use 20gm ram, this is because I hardcoded the memory, which is not good. I try to fix this.

SilasK commented 3 years ago

Respons to : https://github.com/eggnogdb/eggnog-mapper/issues/249#issuecomment-759486005 Sorry @Sofie8 that It takes so long.

What you can do:

Check the filter options for the gene catalog.
There is an option in atlas to split the genecatalog into smaller subsets and annotate them separately. (Yes, this means you would to restart again)
If the annotate_hits_table takes so long, are you using the virtual, ramdisk option as we discussed above. It really improves the speed of this step.
See the code I use to combine the annotations produced by several eggNog runs: rule combine_egg_nogg_annotations:

Sofie8 commented 3 years ago

Hi Silas,

Thanks for your fast reply!

1) I will check the filter options. I am running the annotation on contigs not 1 genome, so it can take more time. 2) spit gene catalogs in smaller sizes . OK will do 3) virtual memory: I think we need to additionally specify --usemem for this to take effect? See below. 4) ok , how can I run this rule separately, where is it saved?

eggNOG_annotation: input: eggnog_db_files=get_eggnog_db_file(), seed = rules.eggNOG_homology_search.output output: temp("{folder}/{prefix}.emapper.annotations") params: data_dir = config['virtual_disk'] if config['eggNOG_use_virtual_disk'] else EGGNOG_DIR, prefix = "{folder}/{prefix}", copyto_shm="t" if config['eggNOG_use_virtual_disk'] else 'f' threads: config.get("threads", 1) resources: mem=20 shadow: "minimal" conda: "%s/eggNOG.yaml" % CONDAENV log: "{folder}/logs/{prefix}/eggNOG_annotate_hits_table.log" shell: """

        if [ {params.copyto_shm} == "t" ] ;
        then
            cp {EGGNOG_DIR}/eggnog.db {params.data_dir}/eggnog.db 2> {log}
        fi

        emapper.py --annotate_hits_table {input.seed} --no_file_comments \
          --override -o {params.prefix} --cpu {threads} --data_dir {params.data_dir} 2>> {log}



- The line mem=20 I, I will change to config["mem"], or perhaps config["mem  - 40 (or what it takes to load the db in memory"]
-  --data_dir: is pointing to the EGGNOG_DIR, or to the virtual disk path: so the virtual disk path is same as EGGNOG_dir no? because I see it downloaded exactly a duplicate of the db in my other virtual disk path..
- --data_dir: I think we need to additionally specify --usemem (If a local hmmpressed database is provided as target using --db, this flag will allocate the whole database in memory using hmmpgmd. Database will be unloaded after execution.
- --database option: specify the target database for sequence searches: here we can choose euk, bact, arch.. (I don't know what the default is, all?). 
- --qtype: (hmm, seq, MmSeqs?)
- The version implemented is emapper-2.0.1, perhaps we need to update to a higher version to be able to use the option mmseq?

Sofie

SilasK commented 3 years ago

by default we take all genes from contigs, but also a lot of partial genes
I followed the instruction here I copy the eggnog.db on a disk that is in memory. SO Iguess the --usemem flag is not necessary. but I could use it if the eggNOG_use_virtual_disk: false .Did you use this otion.
Dont use the rule use the code if you want to hack it yourselves. otherwise run atlas genecatalog with smaller subsets from the beginning. If you use the eggNOG_use_virtual_disk it should be quite fast.

I don't think the mmseqs version already officially released?

Sofie8 commented 3 years ago

Hi Silas,

It finally finished the annotate_hits_table step. I have several partial subsets now (next time I will specify smaller subsets from the beginning).

Now I did:

# run emapper.py annotate_hits_table
source activate /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/conda_envs/13ec4a8d

emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining3_seed_orthologs \
--no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36 --usemem --database /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/eggnogSofie \
--data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset1/eggNOG_annotate_hits_table.log

# set the ones that are finished already aside
cd /ddn1/vol1/site_scratch/leuven/314/vsc31426/NGS_oct/StiemerOct/Genecatalog/subsets/genes
join -v 1 -t $'\t' <(grep -v "^#" remaining2_seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining3_seed_orthologs

Which gives me partial outputs like: subset21.emapper.annotations subset22.emapper.annotations subset23.emapper.annotations ==> needs to be combined to subset2.emapper.annotations (I can use cat but its doing something with the header lines, or the lines where it wants to merge them..)

4) I ran the rule combine_egg_nogg_annotations, but it is not seeing subset21, cause it combines directly subset1.emapper.annotations, subset2.emapper.annotations, etc till 4, and exits also with the error:

[Sun Jan 31 03:20:22 2021]
localrule combine_egg_nogg_annotations:
    input: Genecatalog/subsets/genes/subset4.emapper.annotations, Genecatalog/subsets/genes/subset1.emapper.annotat                                                           ions, Genecatalog/subsets/genes/subset2.emapper.annotations, Genecatalog/subsets/genes/subset3.emapper.annotations
    output: Genecatalog/annotations/eggNog.tsv.gz
    jobid: 335
    resources: mem=150, time=5

Job counts:
        count   jobs
        1       combine_egg_nogg_annotations
        1
[Sun Jan 31 03:20:40 2021]
Error in rule combine_egg_nogg_annotations:
    jobid: 334
    output: Genecatalog/annotations/eggNog.tsv.gz

RuleException:
ParserError in line 592 of /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas206/lib/python3.6/site-pa                                                           ckages/atlas/rules/genecatalog.snakefile:
Error tokenizing data. C error: Expected 22 fields in line 23052, saw 28

I will share you the files via drive, because probably they have an error where it wants to merge, a header line? or putting more items on one line?

Thanks, Sofie

SilasK commented 3 years ago

In theory you can just cat them. The header get's added during the combine_egg_nogg_annotations.

Unfortunately in subset3 and maybe in the others as well there are some errors that cause the error in atlas.

E.g. in line 23052 of subset3 you have an unfinished line combined with the next line. GO:1901360,GO:1901363,GO:Gene0021212',

I don't know a way to make sure that you got all annotations. I fear you would need to rerun the annotation step for subset3. The others are fine.

I'm sorry you had to do this excursion and running eggNog mapper yourself. But if I understand correctly everything could be done inside atlas, isn't it?

atlas allows to define smaller subsets and also to access the ramdisk to speed up the process.

Sofie8 commented 3 years ago

Ok, I deleted that one line for now, and I could merge the others.

atlas allows to define smaller subsets and also to access the ramdisk to speed up the process. Yes, for 500.000 subsets it runs 3 days till 90 % completion and then deletes everything. So I have to try with 100.000 or less, but it might be also let's say it's doing subset 1, 2, 3 fine, and the 4th half, and then deletes everything from subset 4.. but at least I don't loose all the computing credits to restart only subset 4 from scratch.

Let me know when you have a new release that perhaps also fixes this mem usage. But you don't need to release a new version for only this, cause I changed it for now in the rule, or do I need to change it somewhere else, and where :-)? ~/miniconda3/envs/atlas206/lib/python3.6/site-packages/atlas/rules/genecatalog.snakefile

Thanks, Sofie

SilasK commented 3 years ago

I scanned the file and found other problematic ones but if you managed to merge maybe it't ok.

What could be optimized?

Sofie8 commented 3 years ago

Yes it were in the end 2 lines in my subset 3 I think which throw the error, the others it passed without error.

What could be optimized? ==> the hardcoded 20 Gb mem, I changed to reading in the mem in the config file. Further, no real other good solutions, preventing emapper.py --annotate_hits_table from deleting its output file perhaps, but then if wall time is exceeded and it writes an incomplete line, you have the hassle I had above.. Cluster submission is a solution, but qsub from within a qsub submitted pbs script is not allowed for us.. So, ugh, now I 'schedule' myself, I run atlas assembly and atlas genome on the normal mem nodes, when completed I submit atlas genecatalog to the bigmem partition. So scheduling myself a bit, the one after the other.

SilasK commented 3 years ago

@Sofie8 Thank you for your suggestions.

[x] The hard coded mem is changed.
[x] I should make the annotation output in a shadow so it does not get overwritten.
[ ] eggnog mapper using mmseqs2 is now released I should have a look.
[ ] filter genes is still very slow

For your cluster problem, try this out https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-can-i-run-snakemake-on-a-cluster-where-its-main-process-is-not-allowed-to-run-on-the-head-node

MalbertR commented 3 years ago

So, when the pipeline is running other tools I see something like this: submit command: sbatch --parsable --job-name=bam_2_sam_contigs -n 4 --mem=10g --time=30 -N 1 /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/tmp.xschlvv0/snakejob.bam_2_sam_contigs.431.sh

Whereas when running EggNogmapper, this is the command I see:

 /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/lib/python2.7/site-packages
#  emapper-2.0.1
# ./emapper.py  -m diamond --no_annot --no_file_comments --data_dir /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2 --cpu 1 -i Genecatalog/subsets/genes/subset2.faa -o Genecatalog/subsets/genes/subset2 --override
  /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond blastp -d /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2/eggnog_proteins.dmnd -q /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/Genecatalog/subsets/genes/subset2.faa --more-sensitive --threads 1 -e 0.001000 -o /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/shadow/tmpq8x2y2_6/emappertmp_dmdn_eWFYon/e02ee5ee476241f69a07767e97bc0acc --top 3 --query-cover 0 --subject-cover 0

No sbatch anywhere. Also, it does not output a slurm output file, like I do see for the rest of the tools (I don't think this file is created only when the job is finished right? It should be created as soon as the job starts).

SilasK commented 3 years ago

@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?

SilasK commented 3 years ago

From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search

MalbertR commented 3 years ago

@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?

I will check my version and update to the latest (just in case). The re-run and come back to you. Thanks!

MalbertR commented 3 years ago

From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search

Hi Silas,

The issue seems to be solved now. At first I thought it was probably thanks to the updated version of the pipeline, but then I went to have a look if there were differences with how I was running the pipeline before compared to this last time and realized that I wasn't adding the cluster profile before. So, that was probably the issue...Oops!

SilasK commented 3 years ago

@MalbertR No problem. By the way, have a look at the discussion in this thread #351 for optimal results with the atlas genecatalog.

metagenome-atlas / atlas

eggNOG_annotation takes very long #351