metagentools / MetaCoAG

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
https://metacoag.readthedocs.io/en/stable/
GNU General Public License v3.0
57 stars 5 forks source link

keyError in get_cov_len_megahit #3

Open jamesabbott opened 2 years ago

jamesabbott commented 2 years ago

I'm getting the following error with a megahit assembly:

  File "/cluster/gjb_lab/jabbott/miniconda3/envs/metacoag/MetaCoAG/src/metacoag_main.py", line 269, in <module>
    abundance_file=abundance_file)
  File "/cluster/gjb_lab/jabbott/miniconda3/envs/metacoag/MetaCoAG/src/metacoag_utils/feature_utils.py", line 237, in get_cov_len_megahit
    n_samples = len(coverages[0])
KeyError: 0

Looking at the code, coverages is a dict however in my case it has no entry with a key of '0', hence it falls over. If my interpretation is correct, then if there was a key of '0' it would contain the coverage of contig 0. Changing feature_utils.py line 237 to:

     n_samples=len(coverages)

allows it to proceed, although I'm not convinced the returned value is actually what was intended.

Does this sound like the correct approach?

Vini2 commented 2 years ago

Hello @jamesabbott,

Thanks for posting this issue. I'm currently testing MetaCoAG on MEGAHIT assemblies as they have a different format than metaSPAdes.

Setting n_samples=len(coverages) would not be correct as this will set the number of samples to the length of the coverages dictionary.

Can you share with me the format of the abundance file you are using?

Thank you!

Best regards, Vijini

jamesabbott commented 2 years ago

Hi Vijini,

I thought this approach was probably wrong! My abundance file looks like:

k127_21251076 9.4100 k127_17649206 1.4000 k127_1080564 3.8376 k127_5402820 5.3981 k127_9364888 1.8202 k127_7203760 0.5843 k127_10805640 3.5836 k127_1 13.0039

Having reread the documentation this could be my problem:

“Abundance file (in .tsv format) with a contig in a line and its coverage in each sample.”

Previous binning tools I’ve used have worked only on pooled reads so I mapped the full set of reads to the megahit assembly – is the intention here that I do this separately for the reads in each sample?

Many thanks James

From: Vijini Mallawaarachchi @.> Date: Friday, 24 September 2021 at 07:39 To: Vini2/MetaCoAG @.> Cc: James Abbott (Staff) @.>, Mention @.> Subject: Re: [Vini2/MetaCoAG] keyError in get_cov_len_megahit (#3)

CAUTION: This email originated from outside the University of Dundee. Do not click links or open attachments unless you recognise the sender's email address and know the content is safe.

Hello @jamesabbotthttps://github.com/jamesabbott,

Thanks for posting this issue. I'm currently testing MetaCoAG on MEGAHIT assemblies as they have a different format than metaSPAdes.

Setting n_samples=len(coverages) would not be correct as this will set the number of samples to the length of the coverages dictionary.

Can you share with me the format of the abundance file you are using?

Thank you!

Best regards, Vijini

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Vini2/MetaCoAG/issues/3#issuecomment-926384026, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSMMF5IA62YVA5EPGXDUDQMJXANCNFSM5ETMIFJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Dundee is a registered Scottish Charity, No: SC015096

Vini2 commented 2 years ago

Hello @jamesabbott,

Thanks for sharing the format of your abundance file.

MetaCoAG is designed to support coverages from both pooled reads as well as reads from individual samples. So it should not be an issue.

I think the issue is with the format of the abundance file you are using. Can you please check if there is a space or a tab between the contig ID and the coverage value in each line. MetaCoAG expects a .tsv (tab separated) file, so the values should be separated by a tab, not by a space. If possible, can you also attach the abundance file here?

Thank you! Vijini

jamesabbott commented 2 years ago

Hi Vijini,

I’ve just attached the head of the file since the total size is >400Mb. As far as I can see this is correctly tab-delimited:

M-000582:~ jabbott $ cat -vet coverage_top.txt k127_21251076^I9.4100$ k127_17649206^I1.4000$ k127_1080564^I3.8376$ k127_5402820^I5.3981$ k127_9364888^I1.8202$ k127_7203760^I0.5843$ k127_10805640^I3.5836$ k127_1^I13.0039$ k127_21611263^I1.9070$ k127_15127896^I3.6311$

Best Regards James

From: Vijini Mallawaarachchi @.> Date: Friday, 24 September 2021 at 09:08 To: Vini2/MetaCoAG @.> Cc: James Abbott (Staff) @.>, Mention @.> Subject: Re: [Vini2/MetaCoAG] keyError in get_cov_len_megahit (#3)

CAUTION: This email originated from outside the University of Dundee. Do not click links or open attachments unless you recognise the sender's email address and know the content is safe.

Hello @jamesabbotthttps://github.com/jamesabbott,

Thanks for sharing the format of your abundance file.

MetaCoAG is designed to support coverages from both pooled assemblies as well as coverages from individual samples as well. So it should not be an issue.

I think the issue is with the format of the abundance file you are using. Can you please check if there is a space or a tab between the contig ID and the coverage value in each line. MetaCoAG expects a .tsv (tab separated) file, so the values should be separated by a tab, not by a space. If possible, can you also attach the abundance file here?

Thank you! Vijini

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Vini2/MetaCoAG/issues/3#issuecomment-926432522, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSKBZA6EKASLTTJYYO3UDQWZPANCNFSM5ETMIFJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Dundee is a registered Scottish Charity, No: SC015096

k127_21251076 9.4100 k127_17649206 1.4000 k127_1080564 3.8376 k127_5402820 5.3981 k127_9364888 1.8202 k127_7203760 0.5843 k127_10805640 3.5836 k127_1 13.0039 k127_21611263 1.9070 k127_15127896 3.6311

Vini2 commented 2 years ago

Hello @jamesabbott,

Can I know whether you have used the original .fastg file output by MEGAHIT as the assembly graph file?

Currently, MetaCoAG supports only .gfa files for the assembly graph file. I'm currently working on adding support for .fastg files as well. I have updated the documentation and if this is the case, I'm so sorry for the confusion.

Also, can you share with me the first few lines of the assembly graph file as well?

Thank you! Vijini

jamesabbott commented 2 years ago

Hi VIjini,

No problem – the top of the assembly graph is attached. I converted the megahit fastg file to gfa using Heng Li’s gfa1: https://github.com/lh3/gfa1.

I realise now that there are two different versions of gfa. Should this be gfa2 format?

Many thanks James

From: Vijini Mallawaarachchi @.> Date: Saturday, 25 September 2021 at 00:57 To: Vini2/MetaCoAG @.> Cc: James Abbott (Staff) @.>, Mention @.> Subject: Re: [Vini2/MetaCoAG] keyError in get_cov_len_megahit (#3)

CAUTION: This email originated from outside the University of Dundee. Do not click links or open attachments unless you recognise the sender's email address and know the content is safe.

Hello @jamesabbotthttps://github.com/jamesabbott,

Can I know whether you have used the original .fastg file output by MEGAHIT as the assembly graph file?

Currently, MetaCoAG supports only .gfa files for the assembly graph file. I'm currently working on adding support for .fastg files as well.

Also, can you share with me the first few lines of the assembly graph file as well?

Thank you! Vijini

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Vini2/MetaCoAG/issues/3#issuecomment-926974448, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSOJZ2BFQOUBK7V5SG3UDUF47ANCNFSM5ETMIFJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Dundee is a registered Scottish Charity, No: SC015096

Vini2 commented 2 years ago

Hello @jamesabbott,

MetaCoAG currently supports assembly graphs in GFA1 format. I have also tested Heng Li’s fastg2gfa script and the assembly graph produced works fine with MetaCoAG.

Please let me know how your run goes with the new GFA assembly graph. Let me know if you come across further issues.

Thank you very much for using MetaCoAG and pointing out the issues!

Best regards, Vijini

jamesabbott commented 2 years ago

Hi Vijini,

Th previously attached gfa file was created using fastg2gfa, so in theory it should be ok. I’ve just validated it with gfapy and it seems to be ok, and identified as gfa1. The process I’ve gone through to generate the coverage file and gfa file from the megahit output is as follows:

bbwrap.sh ref=megahit/final.contigs.fa in=1.fq.gz in2=2.fq.gz out=aln.sam.gz \ kfilter=22 subfilter=15 maxindel=80 pileup.sh in=aln.sam.gz out=cov.txt

extract contig id and coverage from pileup output

cat cov.txt|grep -v '^#'|awk -F"\t" '{print $1"\t"$2}' | awk -F" " '{print $1"\t"$5}' > coverage.txt megahit_toolkit contig2fastg 127 megahit/final_contigs/.fa > final_contigs.fastg fastg2gfa final_contigs.fastg > final_contigs.gfa

The only other thing I wasn’t sure about was the choice of kmer-size to use when creating the fastg file from the megahit final_contigs.fa, since there are a range of kmers used to generate the final contig set. I tried using the contigs from an intermediate set of contigs produced with a single kmer but that clearly didn’t work with errors occurring due to missing dict keys for particular contig ids.

Any other suggestions you can make would be most welcome!

Many thanks James

From: Vijini Mallawaarachchi @.> Date: Sunday, 26 September 2021 at 00:51 To: Vini2/MetaCoAG @.> Cc: James Abbott (Staff) @.>, Mention @.> Subject: Re: [Vini2/MetaCoAG] keyError in get_cov_len_megahit (#3)

CAUTION: This email originated from outside the University of Dundee. Do not click links or open attachments unless you recognise the sender's email address and know the content is safe.

Hello @jamesabbotthttps://github.com/jamesabbott,

MetaCoAG currently supports assembly graphs in GFA1 format. I have also tested Heng Li’s fastg2gfahttps://github.com/lh3/gfa1/blob/master/misc/fastg2gfa.c script and the assembly graph produced works fine with MetaCoAG.

Please let me know how your run goes with the new GFA assembly graph. Let me know if you come across further issues.

Thank you very much for using MetaCoAG and pointing out the issues!

Best regards, Vijini

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Vini2/MetaCoAG/issues/3#issuecomment-927199507, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSPCHXPSJHJ74VHSF3DUDZN7VANCNFSM5ETMIFJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Dundee is a registered Scottish Charity, No: SC015096

chassenr commented 2 years ago

Hi,

I am having the same issue... I generated the coverage file with coverM (mean coverage). This coverage output worked without problems when using MetaCoAG on a spades assembly. Now I also wanted to test MetaCoAG on a megahit assembly. I used the following approach to generate the required gfa file:

megahit_toolkit contig2fastg 141 k141.contigs.fa > k141.fastg
/sw/bio/gfa1/20160914/gfa1/misc/fastg2gfa k141.fastg > k141.gfa

Here are the first lines of the input files (for contigs only headers are shown):

Any ideas?

Thanks!

PS: Actually, checking again the line references in the error message are different:

Traceback (most recent call last):
  File "/sw/bio/MetaCoAG/1.0/MetaCoAG/src/metacoag_main.py", line 269, in <module>
    abundance_file=abundance_file)
  File "/sw/bio/MetaCoAG/1.0/MetaCoAG/src/metacoag_utils/feature_utils.py", line 232, in get_cov_len_megahit
    n_samples = len(coverages[0])
KeyError: 0
Vini2 commented 2 years ago

Hello @jamesabbott and @chassenr,

I'm extremely sorry for getting back late to you.

I have fixed the KeyError in the MEGAHIT version of MetaCoAG (Commit 4540c66a4b5af5108c9da1aca4bc42af56003a45).

Please get the latest pull from the repo and have a try. Let me know how things go.

Thank you very much for pointing out this error!

Vini2 commented 2 years ago

Hello @jamesabbott,

About your question on what kmer-size to use when creating the .fastg file from the megahit final_contigs.fa, I have seen that the connectivity information change when you change the kmer-size. It depends on how good your assembly is. k=141 can be a good choice for assemblies obtained from reads with read length of 150 - 300bp. Also k=77 can be a good choice for assemblies obtained from reads with read length of 100bp.

Hope this helps. Let me know if you come across any issues. I truly appreciate the input.

Best regards, Vijini

jamesabbott commented 2 years ago

Hi Vijini,

Many thanks – I’ve updated my installation and have started rerunning the assembly. I’ll try a few different kmer sizes to see how the results compare.

Best Regards James

From: Vijini Mallawaarachchi @.> Date: Tuesday, 12 October 2021 at 05:57 To: Vini2/MetaCoAG @.> Cc: James Abbott (Staff) @.>, Mention @.> Subject: Re: [Vini2/MetaCoAG] keyError in get_cov_len_megahit (#3)

CAUTION: This email originated from outside the University of Dundee. Do not click links or open attachments unless you recognise the sender's email address and know the content is safe.

Hello @jamesabbotthttps://github.com/jamesabbott,

About your question on what kmer-size to use when creating the .fastg file from the megahit final_contigs.fa, I have seen that the connectivity information change when you change the kmer-size. It depends on how good your assembly is. k=141 can be a good choice for assemblies obtained from reads with read length of 150 - 300bp. Also k=77 can be a good choice for assemblies obtained from reads with read length of 100bp.

Hope this helps. Let me know if you come across any issues. I truly appreciate the input.

Best regards, Vijini

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Vini2/MetaCoAG/issues/3#issuecomment-940660701, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIRXSO2KVBPHJANWTRNJODUGO54PANCNFSM5ETMIFJA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Dundee is a registered Scottish Charity, No: SC015096

jamesabbott commented 2 years ago

Just to let you know that this seems to have solved the problem - the assembly has now continuned past the point where it previously failed.

I did have a problem with hmmsearch, since the distributed version is not compatible with out RHEL7 based cluster due to a glibc incompatibility. I worked round this by installing hmmer via conda and updating metacoag_utils/marker_gene_utils.py line 15 to hmmExeURL = 'hmmsearch'. It would probably be more robust for you to include hmmer in the environment.yml file rather than distributing a precompiled version of hmmer with your package.

Vini2 commented 2 years ago

Hello @jamesabbott,

I have added fraggenescan and hmmer to the conda environment.yml file you suggested. Commit ID: 938790cd747729365c7551e16e1ce9d94184d4c5

Thank you very much for your input. It has been very useful to improve MetaCoAG.

AmaliT commented 1 week ago

Hi @Vini2

I am getting a similar error - however this is on using spades assembler; any suggestions would be appreciated.

(metacoag)[hraaxt@DS]$ metacoag --assembler spades --graph Data/Indv_assembly/SPAdes/SPAdes-CG3_YS_graph.gfa --contigs Data/Indv_assembly/SPAdes/SPAdes-CG3_YS_contigs.fasta  --paths Data/Indv_assembly/SPAdes/SPAdes-CG3_YS.paths --abundance BinningMAGs_spadesIndv/13.metacoag/coverm/SPAdes-CG3_YS.sp.1K.abud.mean.txt --output BinningMAGs_spadesIndv/13.metacoag/
2024-08-20 11:34:08,148 - INFO - Welcome to MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs.
2024-08-20 11:34:08,154 - INFO - Input arguments: 
2024-08-20 11:34:08,155 - INFO - Assembler used: spades
2024-08-20 11:34:08,155 - INFO - Contigs file: Data/Indv_assembly/SPAdes/SPAdes-CG3_YS_contigs.fasta
2024-08-20 11:34:08,155 - INFO - Assembly graph file: Data/Indv_assembly/SPAdes/SPAdes-CG3_YS_graph.gfa
2024-08-20 11:34:08,156 - INFO - Contig paths file: Data/Indv_assembly/SPAdes/SPAdes-CG3_YS.paths
2024-08-20 11:34:08,156 - INFO - Abundance file: BinningMAGs_spadesIndv/13.metacoag/coverm/SPAdes-CG3_YS.sp.1K.abud.mean.txt
2024-08-20 11:34:08,156 - INFO - Final binning output file: BinningMAGs_spadesIndv/13.metacoag/
2024-08-20 11:34:08,156 - INFO - Marker gene file hmm: auxiliary/marker.hmm
2024-08-20 11:34:08,157 - INFO - Minimum length of contigs to consider: 1000
2024-08-20 11:34:08,157 - INFO - Depth to consider for label propagation: 10
2024-08-20 11:34:08,157 - INFO - p_intra: 0.1
2024-08-20 11:34:08,158 - INFO - p_inter: 0.01
2024-08-20 11:34:08,158 - INFO - Do not use --cut_tc: False
2024-08-20 11:34:08,158 - INFO - mg_threshold: 0.5
2024-08-20 11:34:08,159 - INFO - bin_mg_threshold: 0.33333
2024-08-20 11:34:08,159 - INFO - min_bin_size: 200000 base pairs
2024-08-20 11:34:08,159 - INFO - d_limit: 20
2024-08-20 11:34:08,159 - INFO - Number of threads: 8
2024-08-20 11:34:08,160 - INFO - MetaCoAG started
2024-08-20 11:34:11,603 - INFO - Total number of contigs available: 462521
2024-08-20 11:34:31,118 - INFO - Total number of edges in the assembly graph: 76578
2024-08-20 11:34:32,824 - INFO - Total isolated contigs in the assembly graph: 416025
2024-08-20 11:34:32,826 - INFO - Obtaining lengths and coverage values of contigs
Traceback (most recent call last):
  File "/home/hraaxt/.conda/envs/metacoag/bin/metacoag", line 10, in <module>
    sys.exit(main())
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/metacoag/cli.py", line 266, in main
    metacoag_runner.main(args)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/metacoag/metacoag_runner.py", line 1116, in main
    run(args)
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/metacoag/metacoag_runner.py", line 356, in run
    sequences, coverages, contig_lengths, n_samples = feature_utils.get_cov_len(
  File "/home/hraaxt/.conda/envs/metacoag/lib/pypy3.9/site-packages/metacoag/metacoag_utils/feature_utils.py", line 132, in get_cov_len
    contig_num = contig_names_rev[record.id]
KeyError: 'NODE_1_length_101315_cov_637.129844'

I have checked my abundance files and appears to properly tab etc

NODE_1_length_101315_cov_637.129844^I1606.21$
NODE_2_length_85797_cov_13.260176^I16.4542$
NODE_3_length_70682_cov_14.683903^I18.0536$
NODE_4_length_69444_cov_11.925694^I15.0263$
NODE_5_length_58127_cov_612.813438^I1565$
NODE_6_length_55890_cov_592.726014^I1506.27$
NODE_7_length_53484_cov_29.774130^I80.5852$
NODE_8_length_51875_cov_12.604863^I15.5829$
NODE_9_length_50947_cov_605.768490^I1529.91$
NODE_10_length_49721_cov_28.157875^I77.5952$
Vini2 commented 1 week ago

Hi @AmaliT,

Thanks for your interest in MetaCoAG.

Can you please attach your coverage file to a comment on this issue?

Thanks!

AmaliT commented 1 week ago

please see attached SPAdes-CG3_YS.sp.1K.abud.mean.txt

Vini2 commented 1 week ago

Hi @AmaliT,

Which version of MetaCoAG are you using? You can see the version using the command metacoag --version

AmaliT commented 1 week ago
(metacoag)[hraaxt@wbn004 DS]$ metacoag --version
metacoag, version 1.2.1
Vini2 commented 1 week ago

Can you please check if NODE_1_length_101315_cov_637.129844 is present in the files Data/Indv_assembly/SPAdes/SPAdes-CG3_YS_contigs.fasta and Data/Indv_assembly/SPAdes/SPAdes-CG3_YS.paths? I'm trying to figure out what can go wrong as all my tests pass.

AmaliT commented 1 week ago

Hi @Vini2

This is what I get - could it be the 1 causing the issue on paths? I noticed all of them seem to have *. I had re-create paths file as I didnt have them using the gfa file.


[hraaxt@wbn005 SPAdes]$ grep "NODE_1_length_101315_cov_637.129844" SPAdes-CG3_YS_contigs.fasta
>NODE_1_length_101315_cov_637.129844
[hraaxt@wbn005 SPAdes]$ grep "NODE_1_length_101315_cov_637.129844" SPAdes-CG3_YS.paths
P       NODE_1_length_101315_cov_637.129844_1   452371-,46936328-,349357-,1056222+,568751-,1037928-,45055539+,1059784-,612488-,2436762+,9186418+,4775114+,45016622-,785775+,29079144-,4754406-,702324+,4754504+,30991006+,38374849-,20560135+,42108096+,466331-,524174-,1232573+,448391-,1196672+,1009340+,245135-,625862+,996920+,812668-,568751-,1024684+,42421850+,1056534-,955330+,1051270-,704267+,4754442-,708379+,1053198-,588717+,4752842-     
Vini2 commented 1 week ago

Hi @AmaliT,

Looks like the format of the contigs.paths file is wrong. Lines starting with P should only appear in the assembly graph (.gfa) file.

The contigs.paths file should look something like this (contig names may differ).

NODE_1_length_36105_cov_23.323162
1565+,2359-,4531-,596892-,605901+,519937+,1266517-,487879-,605901+,933161+,1266517-,571160-
NODE_1_length_36105_cov_23.323162'
571160+,1266517+,933161-,605901-,487879+,1266517+,519937-,605901-,596892+,4531+,2359+,1565-
NODE_2_length_35344_cov_36.120916
568636+,596234+,279397-,279390-,495473+,2175+,468435+
NODE_2_length_35344_cov_36.120916'
468435-,2175-,495473-,279390+,279397+,596234-,568636-
...

Can you please double-check the input files? The assembly graph file should begin with lines starting with S.

Thanks!

AmaliT commented 1 week ago

Hi @Vini2

Thanks for that. Its the paths file causing the issue - I was using grep function trying to create paths files as I didnt have them - any suggestions on how one might be able to get paths file when its missing (got deleted as a result of cleanup; have fasta and gfa)?

I have managed to get pass this point with a set for which I had paths file from Spades (version 3.15.3).

Thanks for the help with debugging :)

Much appreciated

Vini2 commented 1 week ago

Hi @AmaliT,

No problem! Currently, I don't have any way to generate the paths file. I will see if I can come up with a script.

Y4nkey commented 2 days ago

Hi @Vini2,

I am quite new to metacoag and have run it on other megahit/metaspades assemblies with no issues except for one megahit assembly.

Here is my error message:

2024-08-29 16:11:25,632 - INFO - Welcome to MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs. 2024-08-29 16:11:25,632 - INFO - Input arguments: 2024-08-29 16:11:25,632 - INFO - Assembler used: megahit 2024-08-29 16:11:25,632 - INFO - Contigs file: /srv/scratch/z5363929/single_assembly/kris_data_bp_deconstruction/289/trimmed/megahit_output/final.contigs.fa 2024-08-29 16:11:25,632 - INFO - Assembly graph file: 289_megahit.gfa 2024-08-29 16:11:25,632 - INFO - Contig paths file: None 2024-08-29 16:11:25,632 - INFO - Abundance file: 289_megahit_abundance.tsv 2024-08-29 16:11:25,632 - INFO - Final binning output file: /srv/scratch/z5363929/single_assembly/kris_data_bp_deconstruction/289/trimmed/megahit_binning/289_metacoag_megahit_output 2024-08-29 16:11:25,632 - INFO - Marker gene file hmm: auxiliary/marker.hmm 2024-08-29 16:11:25,632 - INFO - Minimum length of contigs to consider: 1000 2024-08-29 16:11:25,632 - INFO - Depth to consider for label propagation: 10 2024-08-29 16:11:25,632 - INFO - p_intra: 0.1 2024-08-29 16:11:25,632 - INFO - p_inter: 0.01 2024-08-29 16:11:25,633 - INFO - Do not use --cut_tc: False 2024-08-29 16:11:25,633 - INFO - mg_threshold: 0.5 2024-08-29 16:11:25,633 - INFO - bin_mg_threshold: 0.33333 2024-08-29 16:11:25,633 - INFO - min_bin_size: 200000 base pairs 2024-08-29 16:11:25,633 - INFO - d_limit: 20 2024-08-29 16:11:25,633 - INFO - Number of threads: 8 2024-08-29 16:11:25,633 - INFO - MetaCoAG started 2024-08-29 16:11:37,332 - INFO - Total number of contigs available: 1014430 2024-08-29 16:11:39,239 - INFO - Total number of edges in the assembly graph: 162046 2024-08-29 16:11:39,966 - INFO - Total isolated contigs in the assembly graph: 843129 2024-08-29 16:11:39,966 - INFO - Obtaining lengths and coverage values of contigs Traceback (most recent call last): File "/home/z5363929/miniconda3/envs/metacoag/bin/metacoag", line 10, in <module> sys.exit(main()) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/metacoag/cli.py", line 266, in main metacoag_runner.main(args) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/metacoag/metacoag_runner.py", line 1116, in main run(args) File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/metacoag/metacoag_runner.py", line 347, in run ) = feature_utils.get_cov_len_megahit( File "/home/z5363929/miniconda3/envs/metacoag/lib/python3.9/site-packages/metacoag/metacoag_utils/feature_utils.py", line 183, in get_cov_len_megahit contig_num = contig_names_rev[graph_to_contig_map_rev[record.id]] KeyError: 'k141_634024'

here is the start of my abundance file:

k141_718556 3.5555556 k141_803092 27.656822 k141_549484 17.732317 k141_803093 4.9058523 k141_549485 7.678186 k141_549486 6.2780085 k141_718557 21.584778 k141_549487 3.3239436 k141_549488 3.25 k141_634020 8.736566 k141_718558 4.7214932 k141_803094 85.840775 k141_634021 24.259947

the start of my contigs file:

`>k141_718556 flag=1 multi=3.0000 len=303 AAAGACTGCCACCCTTGAAGTTCAAACTCTGCTGCACCCTGCCACCGGCATGGGCGGCGCACGCCCGAAGACAACGATCGAAGACGACGATGGCTTGTGGCTGGCGAAGTTTCCGATGGACAGCGACGCGCTGCCGATCACGCGGATCGAGCACGCCATGCTGGACCTGGCGAAACGCTGTGGCATCGGCACCATCGACCACAAGCTGCTGGACGTCGAAGGCGTCAAGGAGCCGGTGTTCATGATTCGCCGCTTCGACCGAACCCCCAGCGAGAAGGTCGCCGGGCACTATGAGCGCAGGGG

k141_803092 flag=0 multi=13.7902 len=2114 TCCAAGTCCTCGGGGGACAAATCCTCTGCATCAATCATTTCATTGCGAGCATCTTTTATTGAATGCAGCAGCTCGTCCAATTTCACATGCACGGCTTGGCTATCTTTGTTTTGGCTACGCTGTATAAGAAACACCATAAGAAAAGTCACGATGGTGGTGCCTGTGTTAATGACGAGTTGCCAAGTTTCAGAATAGTGAAAGAGGGGGCCACTGAGTACCCAAATCAGTACCAGGGCTAAGGCCAGCGAAAATGCCGCTGGCGAGCCAGCCCATTTTGTGACGTGTGTAGCGAACCGATCGAACCATCGTAAAATACGTTGAGATAGCGATGGGATTTCGCTTTTTTCGGGTGGATTGTCGTACTGATCGCTCATCAACGCCTCCTTTGCCCCATGTTTATGCGTGGCTGTTGTGCTTCCCTAAACGTTTGCCGCTAACCTCTATCGCCAGGT`

the start of my gfa file:

S NODE_1_length_303_cov_3.0000_ID_1 AAAGACTGCCACCCTTGAAGTTCAAACTCTGCTGCACCCTGCCACCGGCATGGGCGGCGCACGCCCGAAGACAACGATCGAAGACGACGATGGCTTGTGGCTGGCGAAGTTTCCGATGGACAGCGACGCGCTGCCGATCACGCGGATCGAGCACGCCATGCTGGACCTGGCGAAACGCTGTGGCATCGGCACCATCGACCACAAGCTGCTGGACGTCGAAGGCGTCAAGGAGCCGGTGTTCATGATTCGCCGCTTCGACCGAACCCCCAGCGAGAAGGTCGCCGGGCACTATGAGCGCAGGGG LN:i:303 S NODE_2_length_2114_cov_13.7902_ID_3 TCCAAGTCCTCGGGGGACAAATCCTCTGCATCAATCATTTCATTGCGAGCATCTTTTATTGAATGCAGCAGCTCGTCCAATTTCACATGCACGGCTTGGCTATCTTTGTTTTGGCTACGCTGTATAAGAAACACCATAAGAAAAGTCACGATGGTGGTGCCTGTGTTAATGACGAGTTGCCAAGTTTCAGAATAGTGAAAGAGGGGGCCACTGAGTACCCAAATCAGTACCAGGGCTAAGGCCAGCGAAAATGCCGCTGGCGAGCCAGCCCATTTTGTGACGTGTGTAGCGAACCGATCGAACCATCGTAAAATACGTTGAGATAGCGATGGGATTTCGCTTTTTTCGGGTGGATTGTCGTACTGATCGCTCATCAACGCCTCCTTTGCCCCATGTTTATGCGTGGCTGTTGTGCTTCCCTAAACGTTTGCCGCTAACCTCTATCGCCAGGTAAGCGACAAGGGCAAGACATAACGGCACGATGGCATACAGCACCCGATACAACAGTATGGCAGCCAGGACTTCGCTATTGGGCATGCGCGGAGCCAGTGCGGCCACAAATATCGCCTCGGTAACGCCTAGGCCACCGGGAATGTGCGCTATCACGGCAGCGATGCTGCTAAATAGCAGTATGGCAAGTACTTCCAGATACGGTGCGCCTTGCTCCAGCAGTATGTAAATGATCAATGCCATGGTCATCCACGACAGCATCGCAAGTACGCTTTGCAGCAATGCGGTTGCTAGCGTAGGTAATTCGATGTGGTGGCCGCGCACCATCCATGTACGGCGATTCGAGAACGCGCAGAAAGCCAGGTAACCGACTGCCACGGCAAGCATTCCAGCCCCCAAGGCAATAAGTAATCCATCGCTCACCCCCCATTCCTTTGGAACAGGCATGGTCCTGGTAACGAGCAATGCCCCGGCCAGCCAGCAGTAGCCCAGCCAATTTGTCACTGTGCTGAAGAGCACGACGCGAACAGCCGATGGGCGGCGCACACCCAGCTTCAGATATAGTCGCAAGCGTGCGGCGACACCGCCTATGAGTACGCCAAGGCTTTGATTCAGTCCGTAGCTGATTGCGGCCATACCCATCACTTTCGCCGCCGGAACCTTGTGATGTGCGTACCGGCGGGCCACAAGATCAAAGCTGGCATAGGCGGTATACCCTAGCAATACCAAACCTGTCGCCATGGCGATGGTTTTGTTTCCTATATTTTGCGCTGCTTGCAGAACGTCGCGCCAGTCAATGGTCTTTGCAAGGTTAAATAGCAGGAAGATGACTACGACGACTACGGCGGCGATTAAAATCCGCTTTATTGTCGGCCACGCGTTGCTTGCCCAGTTCAGGAGGGTGCCGGCCTGGGTGCTCATGCGACCTCCTGTTTCTTGATGTCTGCTGCATTTTCAATACCAATGGACTGCAGTCTGGGCTTGTGCGCCGGTAAGCTTCCCGCCCACGCGGGAAAATGGCGTAAAAAATGAAACACCACCACACTAAGCCACAACCGGCGCAGCCGCCGACCGGGAAGCACTGGCGTCGGCGCAAGCTTGCAATGCTCGGCAATCAGGCCTTCCAGATTTTTGCGCAATGTTGCATTGAAGATAGCGTCCTGGATAACGACATTGGCTTCCAGGTTCAGTGACAGACTTAGCGGCTCGAGATTGCTCGAACCTATCGTGGACCAGGTGGCATCCACGCAGGCGACCTTGCCATGTAATGGGCGGTCGCAATACTCATAAATGCACACTCCTGCCTGCGTCAAGTATTCATACAGCATGCGCGCGGCGAACCGGGCAATTGCCATGTCGGGCTGACCCTGAAGGATTAGCCGCACGCGCACCCCCCGTTTGCGAGCACGCACGAGCGCACGCAAAAATCGATAGCCTGGAAAAAAATAAGCATTGGCGAGAAGAATGTCATGGCGTGCGGCCTGAATGGCCCTACGGTAAGCTTTTTCTATGTCGTTGCGATGCGTCCCGTTATCGCGGACAACCAGGCGGCCACAGGCGTCGCCCAATGGGGGAACTTCCTTGGGGCCGGCGCGCCACCATCTTCTTCCTGCCTTTGCCTTTATTAAGGCTTCACGTGCGTATTGGCAGATATCGGCTGCCAA LN:i:2114 L NODE_2_length_2114_cov_13.7902_ID_3 - NODE_468602_length_19322_cov_10.9656_ID_937203 + 0M L NODE_2_length_2114_cov_13.7902_ID_3 - NODE_825126_length_28175_cov_23.1830_ID_1650251 - 0M S NODE_3_length_1069_cov_3.6347_ID_5 TCGCCGTGATCATCCAGGTCGGCGCGCCGCGGCGGGAAGGCCGACAGGTCCGGGGCTTGGAACAGCGGCGACATCGAAACCGGGCGCTGACGCGGAGCCGGCGGAGCCGAGTCGGTGTCGGCGGTCGTGGCCGCGGCGTCAGCGGGCGCTGCGACATCCGCTTTCTTCGTCCGCGAGGACGGGGCACGCTTCGCGCGGGTCTTCGGTGCCTCGGACGTGGCGGCTGCGGAGCCCGAATCATCCCCGGTCGCGCTCTCGCCGGACTCCATCGTTTCCGCCGGCGCCGGGGGCGCTGGCGGTTCGGGGACGGATGCCGCTTTCTCGATCACCGGCTGCGCCGCGACCTTGGGCTGCCGGTGTGATCCAAACAGTCGACTGCGCTTTTTCACGCCGTCTTGTGTGCTGTTGTCTTTGTTCTCCACCATTGCTGGCGTACTCCTAGCCCACCACGGGTGGCCGCGCCACCAGTTCGGGTACTCTCTCACGCGCAAGGCCCACTTCGAGGTCTCACGCTAATCCTCACGTTCCGCCATCCGCTTGTCGCACCCGGTGCTCTGCACGCACTGCCCCTGTGGCACTGTTGCGCTGGCACTACGGCACTGCCTGCGGTGACGCTCGTCTTTGATCTTTCCCTTCGGAACTGGCTATCCGACACGCTTTCAATGGCGGCACGGTGTGCCGATTCGTCGTGGCCCGGCTCGGTGCCCGGGAAGCCTTCTTTGGCGCCATTCCGAGCGCGGCCCGGATGACTACTTCAAATTATCGCACGCATTGCCTGCATGCCGGCATTTGGGCCGTGCAATAATCCGAGAGTGCCCCCGATCCAACGCAAACGCCCCGTCGGTCTCGCTGTTTTTCTCGTCGTCGCGGGGATTGTCGGATTCATCGCCGCGTGGGCTCTGACGCTCGACAAGTTCCTCGTGCTCGCCAACCCGTCCGCGACCCTCGGCTGCAACATCAACCCGACCGTGCAGTGCGGAAAGAACCTGGCGGCCCCTCAGGGGTCCGTGTTCGGGTTCCCGAATCCGATCCTCGGGATTGCCGGATTCGTCGCCCCCCTGGTCGTGGG LN:i:1069

Would really appreaciate any help at all. Sorry if I've put too much into this comment, I'm quite new to metagenomics.