qiyunzhu / woltka

Woltka: a versatile meta'omic data classifier
BSD 3-Clause "New" or "Revised" License
71 stars 25 forks source link

AssertionError: Conflicting values found for "RXN-12373". #30

Open adswafford opened 4 years ago

adswafford commented 4 years ago

Running woltka classify as "Step 2: stratified functional profiling" with preserved --outmap in first woltka classify for taxonomy and got this error:

cat ~/plasma_woltka_strat-1.e1315562 Traceback (most recent call last): File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in sys.exit(cli()) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 555, in invoke return callback(args, kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/cli.py", line 181, in classify workflow(kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 98, in workflow tree, rankdic, namedic, root = build_hierarchy( File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 559, in build_hierarchy updatedict(rankdic, {k: rank for k in set(map.values())}) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 75, in update_dict add_dict(dic, key, value) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 44, in add_dict assert dic[key] == value, f'Conflicting values found for "{key}".' AssertionError: Conflicting values found for "RXN-12373". 762.56user 112.71system 14:45.71elapsed 98%CPU (0avgtext+0avgdata 67045276maxresident)k 1568120inputs+32outputs (270major+93314494minor)pagefaults 0swaps

@qiyunzhu any ideas?

qiyunzhu commented 4 years ago

@adswafford Perhaps there are multiple mapping files with conflicting hierarchies. For example, protein-to-gene and protein-to-enzrxn cannot be supplied simultaneously (at least not for now). Can you please provide the full command?

adswafford commented 4 years ago

yep, sorry should have done that before

62 input_dir=/projects/cmi_proj/blood_microbiome/three_studies 63 taxonomy=/projects/cmi_proj/blood_microbiome/three_studies/taxonomy 64 function=/projects/wol/20170307/release/annotation 65 66 #do gotus 67 echo 'starting woltk' 68 conda activate woltk 69 70 echo 'woltk function for plasma' 71 72 #set the alignment directory 73 align_dir=$input_dir/alignment_files 74 75 #make a directory for the output 76 out_dir=$input_dir/woltka 77 mkdir -p $out_dir 78 79 map_dir=$out_dir/mapdir 80 func_dir=$out_dir/taxfunc

82 if [ ! -f $out_dir/plasma.woltk.fin ] 83 then 84 $(which time) woltka classify \ 85 -i $align_dir \ 86 --map $taxonomy/g2tid.txt \ 87 --nodes $taxonomy/nodes.dmp \ 88 --names $taxonomy/names.dmp \ 89 --rank phylum,genus,species,free,none \ 90 --name-as-id \ 91 --outmap $map_dir \ 92 -o $out_dir/taxonomy/ > $out_dir/output_taxonomy.log 93 94 $(which time) woltka classify \ 95 -i $align_dir \ 96 --coords $function/coords.txt.xz \ 97 --map $function/uniref.map.xz \ 98 --map $function/metacyc/protein.map \ 99 --map $function/metacyc/protein2enzrxn.map \ 100 --map $function/metacyc/enzrxn2reaction.map \ 101 --map $function/metacyc/reaction2pathway.map \ 102 --map $function/metacyc/pathway2class.map \ 103 --map-as-rank \ 104 --rank enzrxn2reaction,reaction2pathway,pathway2class \ 105 --stratify $map_dir/genus \ 106 --outmap $map_dir \ 107 -o $func_dir/ > $out_dir/output_function.log 108 109 echo 'plasma woltka finished' > $out_dir/plasma.woltk.fin 110 fi

adswafford commented 4 years ago

Expanded the command to:

And this is the output log: (shogun-1.0.6) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome 04:55 PM]$ cat three_studies/woltka/output_function.log Input directory: /projects/cmi_proj/blood_microbiome/three_studies/alignment_files. Number of alignment files to read: 1252. Number of alignment files to read: 1252. Demultiplexing: off. Stratification file directory: /projects/cmi_proj/blood_microbiome/three_studies/woltka/mapdir/genus. Constructing classification system... Parsing simple map file: /projects/wol/20170307/release/annotation/uniref.map.xz... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/protein.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/protein2enzrxn.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/enzrxn2reaction.map... Done. Parsing simple map file: /projects/wol/20170307/release/annotation/metacyc/reaction2pathway.map...

with the same error message:

Traceback (most recent call last): File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in sys.exit(cli()) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 555, in invoke return callback(args, kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/cli.py", line 181, in classify workflow(kwargs) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 98, in workflow tree, rankdic, namedic, root = build_hierarchy( File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 559, in build_hierarchy updatedict(rankdic, {k: rank for k in set(map.values())}) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 75, in update_dict add_dict(dic, key, value) File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/util.py", line 44, in add_dict assert dic[key] == value, f'Conflicting values found for "{key}".' AssertionError: Conflicting values found for "RXN-12373". 762.56user 112.71system 14:45.71elapsed 98%CPU (0avgtext+0avgdata 67045276maxresident)k 1568120inputs+32outputs (270major+93314494minor)pagefaults 0swaps

I grepped the reaction2pathway file and found that RXN-12373 was duplicated: $ grep RXN-12373 /projects/wol/20170307/release/annotation/metacyc/reaction2pathway.map RXN-12369 RXN-12373 RXN-12372 RXN-12373

but so are many others in that file. Any suggestions @qiyunzhu and @swandro have you run into the same issue?

qiyunzhu commented 4 years ago

@adswafford I figured out what was the problem. This ID RXN-12373 is found both as a reaction and a pathway, which is confusing to the program, and it doesn't make sense (apparently RXN is a reaction not a pathway). I will change the way MetaCyc information is extracted with more consideration. Will get this fixed very soon!

qiyunzhu commented 4 years ago

@adswafford I have fixed the file reaction2pathway.map. I hope this is going to work now!

adswafford commented 4 years ago

Great, I'll give it a try and let you know. I remember when I was looking in the reaction2pathway.map file that I saw several other RXNs as values; should all of these be removed, or are some of them valid pathways?

On Fri, Apr 10, 2020 at 4:51 PM Qiyun Zhu notifications@github.com wrote:

@adswafford https://github.com/adswafford I have fixed the file reaction2pathway.map. I hope this is going to work now!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiyunzhu/woltka/issues/30#issuecomment-612269988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGOEDBXKZJUNW26LFP7ITQDRL6WILANCNFSM4MDL4FLA .

qiyunzhu commented 4 years ago

I also noticed and simply removed all these "RXN" values. Hope this time the file is good to go!

adswafford commented 4 years ago

Got it, thanks!

On Mon, Apr 13, 2020, 8:57 AM Qiyun Zhu notifications@github.com wrote:

I also noticed and simply removed all these "RXN" values. Hope this time the file is good to go!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiyunzhu/woltka/issues/30#issuecomment-612960896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGOEDBTFYYFXVYMB44OBPCLRMMY75ANCNFSM4MDL4FLA .

angelovaag commented 2 years ago

Hi, I have a similar issue using the WoLdb + its associated newick tree. If I dont use the newick tree in the commandline, the woltka classify processing goes through fine, but if added the WoL tree, I get "AssertionError: Conflicting values found for "G000005825", which is the very first entree of the GenID to taxID mapping file.

I have checked for this GenomeID and its associated taxID, but I only have them present 1 time in each of the files (tree file, geneomeID-2-taxID, and taxID-2-lineage mapping files). Is there more information that can be provided that would help me troubleshoot this error?

My command line is DB=/WoL_db linMAP=${DB}/WoLdb_TAXid2LINid.map.txt genMAP=${DB}/WoLdb_GENid2TAXid.map.txt treMAP=${DB}/WoLdb_genomes/WoLdb_tree.nwk

woltka classify -i ${sams} -o woltka_raw.tsv --add-lineage \ --to-tsv --unassigned --lineage ${linMAP} \ --map ${genMAP} --name-as-id --newick ${treMAP}

And the actual error: Input directory: sams_WoLdbwBT2/. Number of alignment files to read: 4. Demultiplexing: off. Constructing classification system... Parsing Newick tree file: WoLdb_tree.nwk... Done. Parsing lineage file: WoLdb_TAXid2LINid.map.txt... Done. Parsing simple map file: WoLdb_GENid2TAXid.map.txt...Traceback (most recent call last): File "/data/conda/envs/woltka/bin/woltka", line 8, in sys.exit(cli()) File "/data/conda/envs/woltka/lib/python3.10/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/data/conda/envs/woltka/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/data/conda/envs/woltka/lib/python3.10/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/conda/envs/woltka/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/data/conda/envs/woltka/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "/data/conda/envs/woltka/lib/python3.10/site-packages/woltka/cli.py", line 175, in classify_cmd workflow(kwargs) File "/data/conda/envs/woltka/lib/python3.10/site-packages/woltka/workflow.py", line 116, in workflow tree, rankdic, namedic, root = build_hierarchy( File "/data/conda/envs/woltka/lib/python3.10/site-packages/woltka/workflow.py", line 713, in build_hierarchy updatedict(tree, map) File "/data/conda/envs/woltka/lib/python3.10/site-packages/woltka/util.py", line 75, in update_dict add_dict(dic, key, value) File "/data/conda/envs/woltka/lib/python3.10/site-packages/woltka/util.py", line 44, in add_dict assert dic[key] == value, f'Conflicting values found for "{key}".' AssertionError: Conflicting values found for "G000005825".

qiyunzhu commented 2 years ago

Hello @angelovaag thanks for your interest in our program. I think the issue is most likely due to the simultaneous use of taxonomy and phylogeny. Woltka allows combining heterogeneous classification systems in one run, but it will complain when one lower-level unit is mapped to multiple high-level units. In your case, I suspect that G000005825, as well as most other genome IDs, are simultaneously mapped to taxonomy IDs defined in ${genMAP}, and tree nodes defined in ${treMAP}. This will cause the trouble. I would recommend separating taxonomic profiling and phylogenetic profiling into two runs.

By the way, Woltka does support one-to-multiple mappings outside the main workflow, in woltka tools collapse. The typical application is to collapse functional units along a hierarchy, like GO or KEGG.

angelovaag commented 2 years ago

@qiyunzhu thank you for the response. I ran Woltka is in phylogenic classification mode and it was successful (thank yoU!), but I am not sure I understand the output, and how to use it for phylogeny-aware microbime analysis. Is there a way to produce a newick tree for each sample from Woltka?

qiyunzhu commented 2 years ago

@angelovaag This "phylogenetic profiling" is something we are still studying. The basic rationale is that we can use nodes instead of tips to represent components of a microbiome. Whether this brings some merit to real studies is currently under investigation. Technically, you can get a newick tree for each sample. We don't have a command built-in in Woltka for this. But with scikit-bio, you can do something like:

import pandas as pd
from skbio import TreeNode
tree = TreeNode.read('wol.nwk')
taxa = pd.read_table('output.tsv', index_col=0).index.tolist()
tree = tree.shear(taxa)
tree.write('output.nwk')
angelovaag commented 2 years ago

Thank you @qiyunzhu

ElDeveloper commented 2 years ago

I am seeing the same issue when I try to run the following command (as suggested by the documentation):

woltka classify --input  file.sam \
--coords function/coords.txt.xz \
--map    function/metacyc/protein.map.xz \
--map    function/metacyc/protein-to-enzrxn.txt \
--map    function/metacyc/enzrxn-to-reaction.txt \
--map    function/metacyc/reaction-to-pathway.txt \
--map    function/metacyc/pathway-to-super_pathway.txt \
--rank   gene,protein,enzrxn,reaction,pathway,super \
--output outdir

This raises the following exception:

Input alignment file: file.sam.
Demultiplexing: on.
Constructing classification system...
  Will extract rank name from map filename.
  Parsing simple map file: protein.map.xz... Done.
  Parsing simple map file: protein-to-enzrxn.txt... Done.
  Parsing simple map file: enzrxn-to-reaction.txt... Done.
  Parsing simple map file: reaction-to-pathway.txt...Traceback (most recent call last):
  File "/Users/yoshiki/miniconda3/envs/mg/bin/woltka", line 8, in <module>
    sys.exit(cli())
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/cli.py", line 179, in classify_cmd
    workflow(**kwargs)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/workflow.py", line 120, in workflow
    map_rank, zippers)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/workflow.py", line 766, in build_hierarchy
    update_dict(rankdic, {k: rank for k in set(map_.values())})
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/util.py", line 75, in update_dict
    add_dict(dic, key, value)
  File "/Users/yoshiki/miniconda3/envs/mg/lib/python3.6/site-packages/woltka/util.py", line 44, in add_dict
    assert dic[key] == value, f'Conflicting values found for "{key}".'
AssertionError: Conflicting values found for "RXN-12373".

If I understood correctly from the conversation above, deleting the repeated entries for whatever key is found to be conflicting should fix the error. However, it seems like there's a fair number of them, so I can keep deleting these until the error goes away but that seems wrong.

For reference, I'm using the metacyc files I accessed from the Globus endpoint that were last updated on 4/15/2021.

qiyunzhu commented 2 years ago

Hello @ElDeveloper Thanks for reporting this issue. I check the data files and found that it is because some MetaCyc entries have conflicting rank assignments. For example, "RXN-12373" is defined as a reaction (in enzrxn-to-reaction.txt) and a pathway (in reaction-to-pathway.txt), which triggers the error.

I think that a work around (and a better approach, because it accounts for multiple assignment) is the second method in the link you pointed to:

woltka classify -i indir -c coords.txt -o gene.biom
woltka tools collapse -i gene.biom     -m gene-to-protein.map     -o protein.biom
woltka tools collapse -i protein.biom  -m protein-to-enzrxn.map   -o enzrxn.biom
woltka tools collapse -i enzrxn.biom   -m enzrxn-to-reaction.map  -o reaction.biom
woltka tools collapse -i reaction.biom -m reaction-to-pathway.map -o pathway.biom
woltka tools collapse -i pathway.biom  -m pathway-to-super.map    -o super.biom

Alternatively, you may use the first method with the KEGG hierachy, which does not have conflicting assignments.

ElDeveloper commented 2 years ago

Thanks @qiyunzhu! I can't close the issue but that solved the problem for MetaCyc. Thanks so much! 👍 Maybe worth making a note in the docs about this as a workaround? If you think that makes sense I can submit a PR.

qiyunzhu commented 2 years ago

@ElDeveloper That would be great. Thank you!

I envision that Woltka's classification system needs an upgrade to make things more flexible (i.e., one classification unit identifier can be assigned to multiple levels). This isn't hard to implement, but it could reduce performance significantly. I plan to get to it during around the winter break.