metagentools / MetaCoAG

šŸš¦šŸ§¬ Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
https://metacoag.readthedocs.io/en/stable/
GNU General Public License v3.0
49 stars 4 forks source link

KeyError contig_4488 #42

Open ZarulHanifah opened 3 months ago

ZarulHanifah commented 3 months ago

Hello Vini,

I got a have been using MetaCoAG for a while, works well most of the time until I got a KeyError: contig_4488. The dataset Ive been working on is ONT, assembled on metaFlye.

This contig_4488 is not present in my flye assembly. An edge_4488 was present in the graph assembly though (Could this be the issue?).

grep -w "contig_4488\|edge_4488" /fs03/jm41/Zarul/C002_D1_results/flye/assembly.fasta /fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa /fs03/jm41/Zarul/C002_D1_results/flye/assembly_info.txt /fs03/jm41/Zarul/C002_D1_results/binning_medaka/metacoag/coverm_abundance.tsv /fs03/jm41/Zarul/C002_D1_results/binning_medaka/metacoag 
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:S  edge_4488   GGCATGACGCCCAGTACCACCACGTACGGGACAGGCATCAATAGCAACACGGGCCTCGGCGCTACTAACAATGGCAATGCCGGCACGACACCCGGCACCGGCGTCTCCGGGGCCGGCAGCAGCGGCGCGACGATGGGCACCAGCGGCACCACAGGCCTCGGCAGTACCTACAATGGCACCACCGGTACGACGCTCGGCACCGGCACCGGTACAACCGGCGTCGGCGCCAATGGCCTCGGCACCGGCGGCGCCACGGGCCTCGGCGGCACCGACAACGGCGCCACCGGCGCGACGCCGGGCACCGGCGGCACTGGAGCGGGGACCGGCGGCACTGGCGGTACTGGCGGCCGGTAAGGCACCCGGAGTACGCCGCTAACGGCGACGGGCGGGGCGAGGAGGCGTCACCCCGTCCGTCGGCGCGCCCGCGCCGGAAAGCTGACCCGTTCCTCGATGCCGGCCGGGTCCTCGTGAATGATGATCTCGGCGTGCGGAAAGGCGCGCTGCAGCTGCGCCTCGACCGCGTCGGAAATCTGGTGCGCGCGCGACAGGCTCATCGCGCCGTCCATCTCGATATGCAGCTGAATAAACGCGGTCGGCCCGGCGATGCGGGTGCGGATGTCATGCACCGCGGTGACTTCGGGATGGCTTTCGGCGATCGCGCGGACCCGGGCGCGCTCCGAATCGGGCAATTCGCGGTCCATCAGCTGGGTCAGCGACAATCGCGCGATCTTGAATGCCCCGCGGATGAGCCACAGCCCGACCGCAGCGCCGAACAGCGGGTCGAGCAGCGGCATCGGAAAGGAGCTGCCGATCGCCAGCGTCGCGATGACGCCGAGGTTCAGGATCAGGTCGCCGCGATAGTGCAATTCATCGGCGCCGATCGCCAACGAGCCGGTGCGTTTGACGACGTAGCGCTGGTAGAGAACCAGGCCGAGCGTCATGGCGATCGCCACCAGCATGACCGCGATCCCCGCCGGCGGGTGCGCCACCGGGCGCGGCTCGGCCAGGCGGCGGATCGCCTCGAACATCAACAAGGCAGCGCTGCCGACGAGAAAGGCGGACTGGGCGAGCGCCGCCAACGGCTCGGCCTTGCCGTGGCCGAAGCGGTGCTGGCGGTCGGGCGGCGTCGCGGCGCGCCGCACGGCGAACAGATTGACCAGCGAGGCGACGGCATCGACCAGCGAATCGACGAGGCTCGACAACAGGGCGACCGAGCCGGTGCCGATCCAGGCGGCGAGCTTGGCGACAATCAGCACCGTCGCGATCGCCAGCGAGGCGGCGGTCGCGCGCCGCCGCAGCATCTGCGCGGCGCCGCGCTCGCTCGTTACCTCGCTCACGGATAGAGGCGCTGTTTGCGCCATCCCTCGCCGTCGCGGACGAACGCCACGCGGTCGTGCAGACGGAACGGCCGCTCCTGCCAAAACTCGACGCTGTCCGGCCATATCCGAAAACCCGACCAGTAGGCGGGTCGCGGCACGGCGGGTTGCTCGGCATAGCGCTGCGAGTACAGCGCGAAGCGGCGCTCCAGCTCGGCGCGCTCGGCGAGCGGGCGCGACTGGTCGGAGGCCCAGGCGCCGATCTGGCTGTCGCGCGGCCGGGTCGCGAAATAGGCGTCGGCCTCGGCCGGCGAGACCGCTCTCGCCTCGCCCTCGATGCGCACCTGGCGGGCCAGCGACTTCCAGTAGAGGCACAGCGCGGCCCGCGGATTGGCCGCCAGCTCCGCGCCCTTGCGGCTGTCGAGATTGGTGTAAAACACGAAGCCGCGCTGGTCGGCGCCCTTGAGCAGCACCGCGCGCAACGACGGCCGCCCGTCCGCTGTCGCGGTCGCCAGCATCGTCGCCTCGGGGATCGGCTCGCACTGCGCGGCCAGCGCGAACCAGCGCGCGAACGGCGCGAACGGTTCGTTCTCGGCGATCTCGTCGGTCATTGCGTGAGGTGGCTCCGCTTTGGTTGTGCGCGCCGGAGCCTTCCCTACTCCGCCCCGCGATCCTCGGCAACCGCCCTGCTCGACACGATCGCGGCCGCCGGCGCCGAAGAAGGGCCGCGGCCGCGGATCTCCGCCAGCAGCGCCGCCAAGGTCACTCGCATCGCCGCCGCCTCGGCCTTGACGATCCGCTCCATCGCCGGCGCGACCTGGCGCTGCCACGACGCCAGCGGCCGCGCCAGCCAGCTGCCGGCGAGCGGCAGACCGAGCGCCAGATAAAGATCGTGCAGCGTCGCGCTATGCGGGTCCCAGGCGAGCACCCAGGCGCCGTCCTGGGTCGGCGCGGTGAACCCGGCCTCGGCGAGGATCTGCAGATGCTCGTCGGCGACCGAGGTCGGCACGCCGAGTTCGCTCGCCAGCATCGCGGTGCGGCAGCGCAGGCCGTGCTGCTGCGCCCGCGCCAGCGCGGCAATCAGCGCCAGCGCGAAACCGAGCCTCACGCCGCCGCTGCTCAGATGCGACAATCGCTCATCGACCCGCCAGGTCGGCAGGTTGGCGGCGACCACGGCGCCGAGCAATACCGCATTCCAGGTGACGTACATCCACAACAGAAAGATCGGGATCGCCGCGAGCGCGCCATAGACGGTCTGATAGAACGACGAGGCGGCGATGTAGATGGAAAATCCAACCTTCAGGATCTCGATGGCGGCCGCGGCGACCGCGGCGCCGAGGAGGCCGTCGCGCCAGCGCACCGCACAATTCGGAATGAGGCAATAGAGCAGTGTGCAGGCGATCAACTCCAACACGAACGGGACAAGGCGCGCGACGACATGCGGCCAGCCGCTCGTCAGCTCCGTCACCAGCGCCGGGTTGAGGCCGGCATGGCGGGCCGCCGTGTCGAGATAGGTCGACAGGGTCAGGCTCATGCCGACCAGCAGCGGGCCCAACGTGATCAGCGTCCAATAGGCGAGCACCCGCTGCACCCAGGGCCGCGGCGTCGTGACCCGCCACAGCGCATTGAGGCGGTCCTCGACCGTAACCAGCAGCAGGACGCCGGTGGCGGCGATGCCGACGAGACCGATCGCGGTCGCCTGCGCCGCCGAACCGGCGAAATACTGGAACCACTGCGCCGCCTGCTCGCTGATCGCCGGCACGAAATTACGAAACAACAGCGCCGGCAGGTCCTGCCGCGCCGGCGCGAAACTCGGGAAGACCGACAGGACGCCGAGCCCGACGACGCCAAGCGGCACCAGCGACACCAGGGTCGTGTAGCTGAGCGCGCCCGAGGCGGCAAAGCAGCCGTCATGGTTGAACCGGTGCAGCGCATAGCGGCAGAAGGTCAGCACCGCCCTGAGCCGGCGGCGCAGCACGCCGTGGCCAGAGTCTCGGCGGCTGAACTTGGCGCGGCCGGGCGACGGAGGACCGCGATGTCG   dp:i:32
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4485   -   edge_4488   +   0M  RC:i:7
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4487   -   edge_4488   +   0M  RC:i:5
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4488   +   edge_265255 +   0M  RC:i:7
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4488   +   edge_265254 +   0M  RC:i:16
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4488   -   edge_265252 -   0M  RC:i:2
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4488   -   edge_100100 -   0M  RC:i:13
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:L  edge_4488   -   edge_265253 +   0M  RC:i:40
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:P  contig_4485 edge_265255-,edge_4488-,edge_4485+,edge_8112-   *
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:P  contig_4487 edge_265254-,edge_4488-,edge_4487+  *
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:P  contig_68975    edge_4488-,edge_265253+,edge_24317-,edge_68975+ *
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:P  contig_100100   edge_277711+,edge_100100+,edge_4488+,edge_265254+   *
/fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa:P  contig_265252   edge_4474-,edge_265252+,edge_4488+  *

As you can see, "contig_4488" is supposedly not present in any of the input files given to MetaCoAG.

The command executed:

metacoag --assembler flye \
    --graph /fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa \
    --contigs /fs03/jm41/Zarul/C002_D1_results/flye/assembly.fasta \
    --paths /fs03/jm41/Zarul/C002_D1_results/flye/assembly_info.txt \
    --abundance /fs03/jm41/Zarul/C002_D1_results/binning_medaka/metacoag/coverm_abundance.tsv \
    --output $outdir &> /fs03/jm41/Zarul/C002_D1_results/log/metacoag_medaka/log.log

Here is the error message:

2024-03-27 02:39:34,410 - INFO - Welcome to MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs.
2024-03-27 02:39:34,429 - INFO - Input arguments: 
2024-03-27 02:39:34,430 - INFO - Assembler used: flye
2024-03-27 02:39:34,430 - INFO - Contigs file: /fs03/jm41/Zarul/C002_D1_results/flye/assembly.fasta
2024-03-27 02:39:34,430 - INFO - Assembly graph file: /fs03/jm41/Zarul/C002_D1_results/flye/assembly_graph.gfa
2024-03-27 02:39:34,430 - INFO - Contig paths file: /fs03/jm41/Zarul/C002_D1_results/flye/assembly_info.txt
2024-03-27 02:39:34,430 - INFO - Abundance file: /fs03/jm41/Zarul/C002_D1_results/binning_medaka/metacoag/coverm_abundance.tsv
2024-03-27 02:39:34,430 - INFO - Final binning output file: /fs03/jm41/Zarul/C002_D1_results/binning_medaka/metacoag
2024-03-27 02:39:34,430 - INFO - Marker gene file hmm: auxiliary/marker.hmm
2024-03-27 02:39:34,430 - INFO - Minimum length of contigs to consider: 1000
2024-03-27 02:39:34,430 - INFO - Depth to consider for label propagation: 10
2024-03-27 02:39:34,431 - INFO - p_intra: 0.1
2024-03-27 02:39:34,431 - INFO - p_inter: 0.01
2024-03-27 02:39:34,431 - INFO - Do not use --cut_tc: False
2024-03-27 02:39:34,431 - INFO - mg_threshold: 0.5
2024-03-27 02:39:34,431 - INFO - bin_mg_threshold: 0.33333
2024-03-27 02:39:34,431 - INFO - min_bin_size: 200000 base pairs
2024-03-27 02:39:34,431 - INFO - d_limit: 20
2024-03-27 02:39:34,431 - INFO - Number of threads: 8
2024-03-27 02:39:34,431 - INFO - MetaCoAG started
2024-03-27 02:39:53,232 - INFO - Total number of contigs available: 269678
2024-03-27 02:39:58,801 - INFO - Total number of edges in the assembly graph: 77552
2024-03-27 02:39:58,928 - INFO - Total isolated contigs in the assembly graph: 244283
2024-03-27 02:39:58,929 - INFO - Obtaining lengths and coverage values of contigs
2024-03-27 02:40:18,190 - INFO - Total long contigs: 267613
2024-03-27 02:40:18,190 - INFO - Total isolated long contigs in the assembly graph: 243244
2024-03-27 02:40:18,191 - INFO - Obtaining tetranucleotide frequencies of contigs
2024-03-27 02:47:08,567 - INFO - Scanning for single-copy marker genes
2024-03-27 02:47:08,636 - INFO - .hmmout file already exists
2024-03-27 02:47:08,636 - INFO - Obtaining contigs with single-copy marker genes
Traceback (most recent call last):
  File "/home/mzar0002/miniconda3/envs/metacoag_/bin/metacoag", line 1260, in <module>
    main()
  File "/fs03/jm41/Zarul/envs/metacoag/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs03/jm41/Zarul/envs/metacoag/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/fs03/jm41/Zarul/envs/metacoag/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs03/jm41/Zarul/envs/metacoag/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mzar0002/miniconda3/envs/metacoag_/bin/metacoag", line 613, in main
    ) = marker_gene_utils.get_contigs_with_marker_genes(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs03/jm41/Zarul/envs/metacoag/lib/python3.12/site-packages/metacoag_utils/marker_gene_utils.py", line 147, in get_contigs_with_marker_genes
    contig_num = contig_names_rev[contig_name]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'contig_4488'

Thank you :pray:

Vini2 commented 1 month ago

Hi @ZarulHanifah,

Sorry about the delay in getting back to you. Were you able to sort out this error?

Thanks, Vijini

ZarulHanifah commented 1 month ago

I havent been able to sort this out. I'm preparing a GDrive link with the relevant files...