Open adswafford opened 4 years ago
@adswafford Perhaps there are multiple mapping files with conflicting hierarchies. For example, protein-to-gene and protein-to-enzrxn cannot be supplied simultaneously (at least not for now). Can you please provide the full command?
yep, sorry should have done that before
62 input_dir=/projects/cmi_proj/blood_microbiome/three_studies 63 taxonomy=/projects/cmi_proj/blood_microbiome/three_studies/taxonomy 64 function=/projects/wol/20170307/release/annotation 65 66 #do gotus 67 echo 'starting woltk' 68 conda activate woltk 69 70 echo 'woltk function for plasma' 71 72 #set the alignment directory 73 align_dir=$input_dir/alignment_files 74 75 #make a directory for the output 76 out_dir=$input_dir/woltka 77 mkdir -p $out_dir 78 79 map_dir=$out_dir/mapdir 80 func_dir=$out_dir/taxfunc
82 if [ ! -f $out_dir/plasma.woltk.fin ] 83 then 84 $(which time) woltka classify \ 85 -i $align_dir \ 86 --map $taxonomy/g2tid.txt \ 87 --nodes $taxonomy/nodes.dmp \ 88 --names $taxonomy/names.dmp \ 89 --rank phylum,genus,species,free,none \ 90 --name-as-id \ 91 --outmap $map_dir \ 92 -o $out_dir/taxonomy/ > $out_dir/output_taxonomy.log 93 94 $(which time) woltka classify \ 95 -i $align_dir \ 96 --coords $function/coords.txt.xz \ 97 --map $function/uniref.map.xz \ 98 --map $function/metacyc/protein.map \ 99 --map $function/metacyc/protein2enzrxn.map \ 100 --map $function/metacyc/enzrxn2reaction.map \ 101 --map $function/metacyc/reaction2pathway.map \ 102 --map $function/metacyc/pathway2class.map \ 103 --map-as-rank \ 104 --rank enzrxn2reaction,reaction2pathway,pathway2class \ 105 --stratify $map_dir/genus \ 106 --outmap $map_dir \ 107 -o $func_dir/ > $out_dir/output_function.log 108 109 echo 'plasma woltka finished' > $out_dir/plasma.woltk.fin 110 fi
Expanded the command to:
And this is the output log: (shogun-1.0.6) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome 04:55 PM]$ cat three_studies/woltka/output_function.log Input directory: /projects/cmi_proj/blood_microbiome/three_studies/alignment_files. Number of alignment files to read: 1252. Number of alignment files to read: 1252. Demultiplexing: off. Stratification file directory: /projects/cmi_proj/blood_microbiome/three_studies/woltka/mapdir/genus. Constructing classification system... Parsing simple map file: /projects/wol/20170307/release/annotation/uniref.map.xz... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/protein.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/protein2enzrxn.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/enzrxn2reaction.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/reaction2pathway.map...
with the same error message:
Traceback (most recent call last):
File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in
I grepped the reaction2pathway file and found that RXN-12373 was duplicated: $ grep RXN-12373 /projects/wol/20170307/release/annotation/metacyc/reaction2pathway.map RXN-12369 RXN-12373 RXN-12372 RXN-12373
but so are many others in that file. Any suggestions @qiyunzhu and @swandro have you run into the same issue?
@adswafford I figured out what was the problem. This ID RXN-12373
is found both as a reaction and a pathway, which is confusing to the program, and it doesn't make sense (apparently RXN is a reaction not a pathway). I will change the way MetaCyc information is extracted with more consideration. Will get this fixed very soon!
@adswafford I have fixed the file reaction2pathway.map
. I hope this is going to work now!
Great, I'll give it a try and let you know. I remember when I was looking in the reaction2pathway.map file that I saw several other RXNs as values; should all of these be removed, or are some of them valid pathways?
On Fri, Apr 10, 2020 at 4:51 PM Qiyun Zhu notifications@github.com wrote:
@adswafford https://github.com/adswafford I have fixed the file reaction2pathway.map. I hope this is going to work now!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiyunzhu/woltka/issues/30#issuecomment-612269988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGOEDBXKZJUNW26LFP7ITQDRL6WILANCNFSM4MDL4FLA .
I also noticed and simply removed all these "RXN" values. Hope this time the file is good to go!
Got it, thanks!
On Mon, Apr 13, 2020, 8:57 AM Qiyun Zhu notifications@github.com wrote:
I also noticed and simply removed all these "RXN" values. Hope this time the file is good to go!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiyunzhu/woltka/issues/30#issuecomment-612960896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGOEDBTFYYFXVYMB44OBPCLRMMY75ANCNFSM4MDL4FLA .
Hi, I have a similar issue using the WoLdb + its associated newick tree. If I dont use the newick tree in the commandline, the woltka classify processing goes through fine, but if added the WoL tree, I get "AssertionError: Conflicting values found for "G000005825", which is the very first entree of the GenID to taxID mapping file.
I have checked for this GenomeID and its associated taxID, but I only have them present 1 time in each of the files (tree file, geneomeID-2-taxID, and taxID-2-lineage mapping files). Is there more information that can be provided that would help me troubleshoot this error?
My command line is DB=/WoL_db linMAP=${DB}/WoLdb_TAXid2LINid.map.txt genMAP=${DB}/WoLdb_GENid2TAXid.map.txt treMAP=${DB}/WoLdb_genomes/WoLdb_tree.nwk
woltka classify -i ${sams} -o woltka_raw.tsv --add-lineage \ --to-tsv --unassigned --lineage ${linMAP} \ --map ${genMAP} --name-as-id --newick ${treMAP}
And the actual error:
Input directory: sams_WoLdbwBT2/.
Number of alignment files to read: 4.
Demultiplexing: off.
Constructing classification system...
Parsing Newick tree file: WoLdb_tree.nwk... Done.
Parsing lineage file: WoLdb_TAXid2LINid.map.txt... Done.
Parsing simple map file: WoLdb_GENid2TAXid.map.txt...Traceback (most recent call last):
File "/data/conda/envs/woltka/bin/woltka", line 8, in
Hello @angelovaag thanks for your interest in our program. I think the issue is most likely due to the simultaneous use of taxonomy and phylogeny. Woltka allows combining heterogeneous classification systems in one run, but it will complain when one lower-level unit is mapped to multiple high-level units. In your case, I suspect that G000005825
, as well as most other genome IDs, are simultaneously mapped to taxonomy IDs defined in ${genMAP}
, and tree nodes defined in ${treMAP}
. This will cause the trouble. I would recommend separating taxonomic profiling and phylogenetic profiling into two runs.
By the way, Woltka does support one-to-multiple mappings outside the main workflow, in woltka tools collapse
. The typical application is to collapse functional units along a hierarchy, like GO or KEGG.
@qiyunzhu thank you for the response. I ran Woltka is in phylogenic classification mode and it was successful (thank yoU!), but I am not sure I understand the output, and how to use it for phylogeny-aware microbime analysis. Is there a way to produce a newick tree for each sample from Woltka?
@angelovaag This "phylogenetic profiling" is something we are still studying. The basic rationale is that we can use nodes instead of tips to represent components of a microbiome. Whether this brings some merit to real studies is currently under investigation. Technically, you can get a newick tree for each sample. We don't have a command built-in in Woltka for this. But with scikit-bio, you can do something like:
import pandas as pd
from skbio import TreeNode
tree = TreeNode.read('wol.nwk')
taxa = pd.read_table('output.tsv', index_col=0).index.tolist()
tree = tree.shear(taxa)
tree.write('output.nwk')
Thank you @qiyunzhu
I am seeing the same issue when I try to run the following command (as suggested by the documentation):
woltka classify --input file.sam \
--coords function/coords.txt.xz \
--map function/metacyc/protein.map.xz \
--map function/metacyc/protein-to-enzrxn.txt \
--map function/metacyc/enzrxn-to-reaction.txt \
--map function/metacyc/reaction-to-pathway.txt \
--map function/metacyc/pathway-to-super_pathway.txt \
--rank gene,protein,enzrxn,reaction,pathway,super \
--output outdir
This raises the following exception:
Input alignment file: file.sam.
Demultiplexing: on.
Constructing classification system...
Will extract rank name from map filename.
Parsing simple map file: protein.map.xz... Done.
Parsing simple map file: protein-to-enzrxn.txt... Done.
Parsing simple map file: enzrxn-to-reaction.txt... Done.
Parsing simple map file: reaction-to-pathway.txt...Traceback (most recent call last):
File "/Users/yoshiki/miniconda3/envs/mg/bin/woltka", line 8, in <module>
sys.exit(cli())
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/cli.py", line 179, in classify_cmd
workflow(**kwargs)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/workflow.py", line 120, in workflow
map_rank, zippers)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/workflow.py", line 766, in build_hierarchy
update_dict(rankdic, {k: rank for k in set(map_.values())})
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/util.py", line 75, in update_dict
add_dict(dic, key, value)
File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/util.py", line 44, in add_dict
assert dic[key] == value, f'Conflicting values found for "{key}".'
AssertionError: Conflicting values found for "RXN-12373".
If I understood correctly from the conversation above, deleting the repeated entries for whatever key is found to be conflicting should fix the error. However, it seems like there's a fair number of them, so I can keep deleting these until the error goes away but that seems wrong.
For reference, I'm using the metacyc files I accessed from the Globus endpoint that were last updated on 4/15/2021.
Hello @ElDeveloper Thanks for reporting this issue. I check the data files and found that it is because some MetaCyc entries have conflicting rank assignments. For example, "RXN-12373" is defined as a reaction (in enzrxn-to-reaction.txt
) and a pathway (in reaction-to-pathway.txt
), which triggers the error.
I think that a work around (and a better approach, because it accounts for multiple assignment) is the second method in the link you pointed to:
woltka classify -i indir -c coords.txt -o gene.biom
woltka tools collapse -i gene.biom -m gene-to-protein.map -o protein.biom
woltka tools collapse -i protein.biom -m protein-to-enzrxn.map -o enzrxn.biom
woltka tools collapse -i enzrxn.biom -m enzrxn-to-reaction.map -o reaction.biom
woltka tools collapse -i reaction.biom -m reaction-to-pathway.map -o pathway.biom
woltka tools collapse -i pathway.biom -m pathway-to-super.map -o super.biom
Alternatively, you may use the first method with the KEGG hierachy, which does not have conflicting assignments.
Thanks @qiyunzhu! I can't close the issue but that solved the problem for MetaCyc. Thanks so much! 👍 Maybe worth making a note in the docs about this as a workaround? If you think that makes sense I can submit a PR.
@ElDeveloper That would be great. Thank you!
I envision that Woltka's classification system needs an upgrade to make things more flexible (i.e., one classification unit identifier can be assigned to multiple levels). This isn't hard to implement, but it could reduce performance significantly. I plan to get to it during around the winter break.
Running woltka classify as "Step 2: stratified functional profiling" with preserved --outmap in first woltka classify for taxonomy and got this error:
cat ~/plasma_woltka_strat-1.e1315562 Traceback (most recent call last): File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in
sys.exit(cli())
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 764, in call
return self.main(args, kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(args, kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/cli.py", line 181, in classify
workflow(kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 98, in workflow
tree, rankdic, namedic, root = build_hierarchy(
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 559, in build_hierarchy
updatedict(rankdic, {k: rank for k in set(map.values())})
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 75, in update_dict
add_dict(dic, key, value)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 44, in add_dict
assert dic[key] == value, f'Conflicting values found for "{key}".'
AssertionError: Conflicting values found for "RXN-12373".
762.56user 112.71system 14:45.71elapsed 98%CPU (0avgtext+0avgdata 67045276maxresident)k
1568120inputs+32outputs (270major+93314494minor)pagefaults 0swaps
@qiyunzhu any ideas?