Closed SilasK closed 3 years ago
Hey @Sofie8 I put this question online as it might also help others.
So yes you can use the virtual memory for the eggNOg mapper step. scratch should be fine but the default /dev/shm is even better if it exists on your server.
If you set
eggNOG_use_virtual_disk: true
the eggNog db gets copied to the virtual disk and accelerates the access.
Why does it use 20gm ram, this is because I hardcoded the memory, which is not good. I try to fix this.
Respons to : https://github.com/eggnogdb/eggnog-mapper/issues/249#issuecomment-759486005 Sorry @Sofie8 that It takes so long.
What you can do:
Hi Silas,
Thanks for your fast reply!
1) I will check the filter options. I am running the annotation on contigs not 1 genome, so it can take more time. 2) spit gene catalogs in smaller sizes . OK will do 3) virtual memory: I think we need to additionally specify --usemem for this to take effect? See below. 4) ok , how can I run this rule separately, where is it saved?
eggNOG_annotation: input: eggnog_db_files=get_eggnog_db_file(), seed = rules.eggNOG_homology_search.output output: temp("{folder}/{prefix}.emapper.annotations") params: data_dir = config['virtual_disk'] if config['eggNOG_use_virtual_disk'] else EGGNOG_DIR, prefix = "{folder}/{prefix}", copyto_shm="t" if config['eggNOG_use_virtual_disk'] else 'f' threads: config.get("threads", 1) resources: mem=20 shadow: "minimal" conda: "%s/eggNOG.yaml" % CONDAENV log: "{folder}/logs/{prefix}/eggNOG_annotate_hits_table.log" shell: """
if [ {params.copyto_shm} == "t" ] ;
then
cp {EGGNOG_DIR}/eggnog.db {params.data_dir}/eggnog.db 2> {log}
fi
emapper.py --annotate_hits_table {input.seed} --no_file_comments \
--override -o {params.prefix} --cpu {threads} --data_dir {params.data_dir} 2>> {log}
- The line mem=20 I, I will change to config["mem"], or perhaps config["mem - 40 (or what it takes to load the db in memory"]
- --data_dir: is pointing to the EGGNOG_DIR, or to the virtual disk path: so the virtual disk path is same as EGGNOG_dir no? because I see it downloaded exactly a duplicate of the db in my other virtual disk path..
- --data_dir: I think we need to additionally specify --usemem (If a local hmmpressed database is provided as target using --db, this flag will allocate the whole database in memory using hmmpgmd. Database will be unloaded after execution.
- --database option: specify the target database for sequence searches: here we can choose euk, bact, arch.. (I don't know what the default is, all?).
- --qtype: (hmm, seq, MmSeqs?)
- The version implemented is emapper-2.0.1, perhaps we need to update to a higher version to be able to use the option mmseq?
Sofie
I don't think the mmseqs version already officially released?
Hi Silas,
It finally finished the annotate_hits_table step. I have several partial subsets now (next time I will specify smaller subsets from the beginning).
Now I did:
# run emapper.py annotate_hits_table
source activate /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/conda_envs/13ec4a8d
emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining3_seed_orthologs \
--no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36 --usemem --database /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/eggnogSofie \
--data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset1/eggNOG_annotate_hits_table.log
# set the ones that are finished already aside
cd /ddn1/vol1/site_scratch/leuven/314/vsc31426/NGS_oct/StiemerOct/Genecatalog/subsets/genes
join -v 1 -t $'\t' <(grep -v "^#" remaining2_seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining3_seed_orthologs
Which gives me partial outputs like: subset21.emapper.annotations subset22.emapper.annotations subset23.emapper.annotations ==> needs to be combined to subset2.emapper.annotations (I can use cat but its doing something with the header lines, or the lines where it wants to merge them..)
4) I ran the rule combine_egg_nogg_annotations, but it is not seeing subset21, cause it combines directly subset1.emapper.annotations, subset2.emapper.annotations, etc till 4, and exits also with the error:
[Sun Jan 31 03:20:22 2021]
localrule combine_egg_nogg_annotations:
input: Genecatalog/subsets/genes/subset4.emapper.annotations, Genecatalog/subsets/genes/subset1.emapper.annotat ions, Genecatalog/subsets/genes/subset2.emapper.annotations, Genecatalog/subsets/genes/subset3.emapper.annotations
output: Genecatalog/annotations/eggNog.tsv.gz
jobid: 335
resources: mem=150, time=5
Job counts:
count jobs
1 combine_egg_nogg_annotations
1
[Sun Jan 31 03:20:40 2021]
Error in rule combine_egg_nogg_annotations:
jobid: 334
output: Genecatalog/annotations/eggNog.tsv.gz
RuleException:
ParserError in line 592 of /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas206/lib/python3.6/site-pa ckages/atlas/rules/genecatalog.snakefile:
Error tokenizing data. C error: Expected 22 fields in line 23052, saw 28
I will share you the files via drive, because probably they have an error where it wants to merge, a header line? or putting more items on one line?
Thanks, Sofie
In theory you can just cat them. The header get's added during the combine_egg_nogg_annotations
.
Unfortunately in subset3 and maybe in the others as well there are some errors that cause the error in atlas.
E.g. in line 23052 of subset3 you have an unfinished line combined with the next line.
GO:1901360,GO:1901363,GO:Gene0021212',
I don't know a way to make sure that you got all annotations. I fear you would need to rerun the annotation step for subset3. The others are fine.
I'm sorry you had to do this excursion and running eggNog mapper yourself. But if I understand correctly everything could be done inside atlas, isn't it?
atlas allows to define smaller subsets and also to access the ramdisk to speed up the process.
Ok, I deleted that one line for now, and I could merge the others.
atlas allows to define smaller subsets and also to access the ramdisk to speed up the process. Yes, for 500.000 subsets it runs 3 days till 90 % completion and then deletes everything. So I have to try with 100.000 or less, but it might be also let's say it's doing subset 1, 2, 3 fine, and the 4th half, and then deletes everything from subset 4.. but at least I don't loose all the computing credits to restart only subset 4 from scratch.
Let me know when you have a new release that perhaps also fixes this mem usage. But you don't need to release a new version for only this, cause I changed it for now in the rule, or do I need to change it somewhere else, and where :-)? ~/miniconda3/envs/atlas206/lib/python3.6/site-packages/atlas/rules/genecatalog.snakefile
Thanks, Sofie
I scanned the file and found other problematic ones but if you managed to merge maybe it't ok.
What could be optimized?
Yes it were in the end 2 lines in my subset 3 I think which throw the error, the others it passed without error.
What could be optimized? ==> the hardcoded 20 Gb mem, I changed to reading in the mem in the config file. Further, no real other good solutions, preventing emapper.py --annotate_hits_table from deleting its output file perhaps, but then if wall time is exceeded and it writes an incomplete line, you have the hassle I had above.. Cluster submission is a solution, but qsub from within a qsub submitted pbs script is not allowed for us.. So, ugh, now I 'schedule' myself, I run atlas assembly and atlas genome on the normal mem nodes, when completed I submit atlas genecatalog to the bigmem partition. So scheduling myself a bit, the one after the other.
@Sofie8 Thank you for your suggestions.
For your cluster problem, try this out https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-can-i-run-snakemake-on-a-cluster-where-its-main-process-is-not-allowed-to-run-on-the-head-node
So, when the pipeline is running other tools I see something like this:
submit command: sbatch --parsable --job-name=bam_2_sam_contigs -n 4 --mem=10g --time=30 -N 1 /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/tmp.xschlvv0/snakejob.bam_2_sam_contigs.431.sh
Whereas when running EggNogmapper, this is the command I see:
/hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/lib/python2.7/site-packages
# emapper-2.0.1
# ./emapper.py -m diamond --no_annot --no_file_comments --data_dir /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2 --cpu 1 -i Genecatalog/subsets/genes/subset2.faa -o Genecatalog/subsets/genes/subset2 --override
/hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond blastp -d /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2/eggnog_proteins.dmnd -q /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/Genecatalog/subsets/genes/subset2.faa --more-sensitive --threads 1 -e 0.001000 -o /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/shadow/tmpq8x2y2_6/emappertmp_dmdn_eWFYon/e02ee5ee476241f69a07767e97bc0acc --top 3 --query-cover 0 --subject-cover 0
No sbatch anywhere. Also, it does not output a slurm output file, like I do see for the rest of the tools (I don't think this file is created only when the job is finished right? It should be created as soon as the job starts).
@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?
From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search
@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?
I will check my version and update to the latest (just in case). The re-run and come back to you. Thanks!
From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search
Hi Silas,
The issue seems to be solved now. At first I thought it was probably thanks to the updated version of the pipeline, but then I went to have a look if there were differences with how I was running the pipeline before compared to this last time and realized that I wasn't adding the cluster profile before. So, that was probably the issue...Oops!
@MalbertR No problem. By the way, have a look at the discussion in this thread #351 for optimal results with the atlas genecatalog.
This below is how it runs in the log, do you know why the mem is only 20, because I specified 160 in the config?
[Sun Jan 3 13:44:22 2021] rule eggNOG_annotation: input: /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog.db, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog_proteins.dmnd, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/checksum_checked, Genecatalog/subsets/genes/subset3.emapper.seed_orthologs output: Genecatalog/subsets/genes/subset3.emapper.annotations log: Genecatalog/subsets/genes/logs/subset3/eggNOG_annotate_hits_table.log jobid: 1075 wildcards: folder=Genecatalog/subsets/genes, prefix=subset3 threads: 36 resources: mem=20, time=5