metagentools / MetaCoAG

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
https://metacoag.readthedocs.io/en/stable/
GNU General Public License v3.0
57 stars 5 forks source link

[Feature Request] Custom marker sets as an argument #1

Closed jolespin closed 1 year ago

jolespin commented 2 years ago

Hi there! I was hoping that you would eventually make an actual binning tool after I read your GraphBin papers! Looking forward to using this.

One aspect that many binners are lacking is their ability to handle fungus, protists, and sometimes CPR. Though, there's actually decent marker sets for these.

Would it be possible to include a custom HMM set as an argument for binning tool so we could pull these "hard to reach" organisms out easier?

I've been in slight contact w/ the DAS_Tool developer and they are working on integrating this into their binning refinement package (https://github.com/cmks/DAS_Tool/issues/69). It would be nice to use your tool and DAS_Tool synergistically to pull out protists that are largely overlooked.

IMO, this is the next big hurdle for metagenomics and also some low hanging fruit for developers.

Vini2 commented 2 years ago

Hi @jolespin!

Thanks for your comments on our work!

This is a wonderful suggestion. I believe I can add the functionality to provide a custom HMM set in MetaCoAG.

If you happen to know, could you share with me some resources on single-copy marker genes of microorganisms other than the ones available at https://github.com/merenlab/anvio/tree/master/anvio/data/hmm?

Thank you very much!

Best regards, Vijini

jolespin commented 2 years ago

Here's the CPR marker set:

https://github.com/Ecogenomics/CheckM/blob/master/custom_marker_sets/cpr_43_markers.hmm

Here's (what I believe) are fungal markers:

https://github.com/stajichlab/FGMP/blob/master/data/593_cleanMarkers.hmm.gz

Tricky thing about eukaryotes is that prokaryotic gene callers won't work well so it might be a good idea to have an option to include custom gene calls as well.

Vini2 commented 2 years ago

Thank you very much for these resources. I will take a look and try to integrate them as well.

Vini2 commented 2 years ago

Hi @jolespin,

I have added the functionality to provide a custom marker set to MetaCoAG. Please have a try when you are available and let me know how it goes. I am currently trying to integrate multiple marker sets at the same time.

Thank you!

jolespin commented 2 years ago

Running it for an assembly with known fungal bins right now.

(metacoag_env) -bash-4.2$ ./MetaCoAG --assembler spades --graph assembly_graph_with_scaffolds.gfa --contigs scaffolds.fasta --output metacoag_output --hmm ../../db/markersets/Fungi_593.hmm.gz --nthreads 4 --abundance coverage_noheader.tsv --paths scaffolds.paths
2021-10-16 13:04:59,537 - INFO - Welcome to MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs.
2021-10-16 13:04:59,538 - INFO - This version of MetaCoAG makes use of the assembly graph produced by SPAdes which is based on the de Bruijn graph approach.
2021-10-16 13:04:59,538 - INFO - Input arguments:
2021-10-16 13:04:59,538 - INFO - Assembler used: spades
2021-10-16 13:04:59,538 - INFO - Contigs file: scaffolds.fasta
2021-10-16 13:04:59,538 - INFO - Assembly graph file: assembly_graph_with_scaffolds.gfa
2021-10-16 13:04:59,538 - INFO - Contig paths file: scaffolds.paths
2021-10-16 13:04:59,538 - INFO - Abundance file: coverage_noheader.tsv
2021-10-16 13:04:59,538 - INFO - Final binning output file: metacoag_output/
2021-10-16 13:04:59,538 - INFO - Marker file: /local/ifs3_scratch/CORE/jespinoz/Testing/MetaCoAG/auxiliary/marker.hmm
2021-10-16 13:04:59,538 - INFO - Minimum length of contigs to consider: 1000
2021-10-16 13:04:59,538 - INFO - Depth to consider for label propagation: 10
2021-10-16 13:04:59,538 - INFO - p_intra: 0.1
2021-10-16 13:04:59,538 - INFO - p_inter: 0.01
2021-10-16 13:04:59,538 - INFO - mg_threshold: 0.5
2021-10-16 13:04:59,538 - INFO - bin_mg_threshold: 0.33333
2021-10-16 13:04:59,538 - INFO - min_bin_size: 200000 base pairs
2021-10-16 13:04:59,538 - INFO - d_limit: 20
2021-10-16 13:04:59,538 - INFO - Number of threads: 4
2021-10-16 13:04:59,538 - INFO - MetaCoAG started
2021-10-16 13:05:01,565 - INFO - Total number of contigs available: 86622
2021-10-16 13:05:02,420 - INFO - Total number of edges in the assembly graph: 27025
2021-10-16 13:05:02,420 - INFO - Obtaining lengths and coverage values of contigs
2021-10-16 13:05:03,959 - INFO - Obtaining tetranucleotide frequencies of contigs
2021-10-16 13:06:51,972 - INFO - Scanning for single-copy marker genes
2021-10-16 13:06:51,974 - INFO - Obtaining hmmout file
2021-10-16 13:08:13,619 - INFO - Obtaining contigs with single-copy marker genes
2021-10-16 13:08:13,648 - INFO - Number of contigs containing single-copy marker genes: 913
2021-10-16 13:08:13,648 - INFO - Determining contig counts for each single-copy marker gene
2021-10-16 13:08:13,649 - INFO - Initialising bins
2021-10-16 13:08:13,650 - INFO - Matching and assigning contigs with single-copy marker genes to bins
2021-10-16 13:08:24,020 - INFO - Further assigning contigs with single-copy marker genes
2021-10-16 13:09:02,892 - INFO - Propagating labels to connected vertices of unlabelled long contigs
2021-10-16 13:09:39,206 - INFO - Further propagating labels to vertices of unlabelled long contigs
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11557/11557 [00:17<00:00, 661.87it/s]
2021-10-16 13:09:57,572 - INFO - Further propagating labels to connected vertices of unlabelled long contigs
2021-10-16 13:11:25,639 - INFO - Elapsed time: 386.10053539276123 seconds
2021-10-16 13:11:25,675 - INFO - Writing the Final Binning result to file
2021-10-16 13:13:13,889 - INFO - Producing 21 bins...
2021-10-16 13:13:13,889 - INFO - Final binning results can be found in metacoag_output/bins/
2021-10-16 13:13:13,890 - INFO - Thank you for using MetaCoAG!

Update: It worked great and ran to completion (I believe)! Very fast too!

From the log file, can you tell if the correct marker set was used? I think it may have used the default set.

I need to compare the results against my previous bins detected via CONCOCT and METABAT2 but will update you soon with the N50s and potential contamination as well.

One more feature suggestion that would be really useful w/ limited work. A scaffolds_to_bin.tsv file where the first column is a scaffold and second column is the bin assignment. This could be the most concise output and could feed directly into tools like DAS_Tool or used w/ seqkit for grepping

jolespin commented 2 years ago

I think it's these lines:

https://github.com/Vini2/MetaCoAG/blob/2123c830a0b7d2612e6cd89d4cf3aa7d504b9219/MetaCoAG#L277-L295

They don't include the --hmm argument

Vini2 commented 2 years ago

Hi @jolespin,

My apologies. I have added the --hmm argument and fixed the issue. Please give it a try with your custom marker set and let me know how it goes. The log file should show the path of the marker file used as INFO - Marker file: <file_name>.

I have also added the functionality to output the contig_to_bin.tsv file. The default delimiter is set to the comma. You can change this to a tab by specifying --delimiter $'\t' when running MetaCoAG.

Thank you!

jolespin commented 2 years ago

Command:

(metacoag_env) -bash-4.2$ cat cmd.sh
./MetaCoAG/MetaCoAG --assembler spades --graph assembly_graph_with_scaffolds.gfa --contigs scaffolds.fasta --output metacoag_output --hmm ../db/markersets/Fungi_593.hmm.gz --nthreads 4 --abundance coverage_noheader.tsv --paths scaffolds.paths --delimiter $'\t'

Log:

(metacoag_env) -bash-4.2$ bash cmd.sh
2021-10-18 13:22:18,266 - INFO - Welcome to MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs.
2021-10-18 13:22:18,266 - INFO - This version of MetaCoAG makes use of the assembly graph produced by SPAdes which is based on the de Bruijn graph approach.
2021-10-18 13:22:18,267 - INFO - Input arguments:
2021-10-18 13:22:18,267 - INFO - Assembler used: spades
2021-10-18 13:22:18,267 - INFO - Contigs file: scaffolds.fasta
2021-10-18 13:22:18,267 - INFO - Assembly graph file: assembly_graph_with_scaffolds.gfa
2021-10-18 13:22:18,267 - INFO - Contig paths file: scaffolds.paths
2021-10-18 13:22:18,267 - INFO - Abundance file: coverage_noheader.tsv
2021-10-18 13:22:18,267 - INFO - Final binning output file: metacoag_output/
2021-10-18 13:22:18,267 - INFO - Marker file: ../db/markersets/Fungi_593.hmm.gz
2021-10-18 13:22:18,267 - INFO - Minimum length of contigs to consider: 1000
2021-10-18 13:22:18,267 - INFO - Depth to consider for label propagation: 10
2021-10-18 13:22:18,267 - INFO - p_intra: 0.1
2021-10-18 13:22:18,267 - INFO - p_inter: 0.01
2021-10-18 13:22:18,267 - INFO - mg_threshold: 0.5
2021-10-18 13:22:18,267 - INFO - bin_mg_threshold: 0.33333
2021-10-18 13:22:18,267 - INFO - min_bin_size: 200000 base pairs
2021-10-18 13:22:18,267 - INFO - d_limit: 20
2021-10-18 13:22:18,267 - INFO - Number of threads: 4
2021-10-18 13:22:18,267 - INFO - MetaCoAG started
2021-10-18 13:22:20,345 - INFO - Total number of contigs available: 86622
2021-10-18 13:22:21,368 - INFO - Total number of edges in the assembly graph: 27025
2021-10-18 13:22:21,369 - INFO - Obtaining lengths and coverage values of contigs
2021-10-18 13:22:22,901 - INFO - Obtaining tetranucleotide frequencies of contigs
2021-10-18 13:24:02,811 - INFO - Scanning for single-copy marker genes
2021-10-18 13:24:02,812 - INFO - .hmmout file already exists
2021-10-18 13:24:02,813 - INFO - Obtaining contigs with single-copy marker genes
2021-10-18 13:24:02,837 - INFO - Number of contigs containing single-copy marker genes: 913
2021-10-18 13:24:02,837 - INFO - Determining contig counts for each single-copy marker gene
2021-10-18 13:24:02,838 - INFO - Initialising bins
2021-10-18 13:24:02,838 - INFO - Matching and assigning contigs with single-copy marker genes to bins
2021-10-18 13:24:13,350 - INFO - Further assigning contigs with single-copy marker genes
2021-10-18 13:24:50,097 - INFO - Propagating labels to connected vertices of unlabelled long contigs
2021-10-18 13:25:30,344 - INFO - Further propagating labels to vertices of unlabelled long contigs
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11557/11557 [00:18<00:00, 613.40it/s]
2021-10-18 13:25:49,999 - INFO - Further propagating labels to connected vertices of unlabelled long contigs
2021-10-18 13:27:17,540 - INFO - Elapsed time: 299.2726888656616 seconds
2021-10-18 13:27:17,578 - INFO - Writing the Final Binning result to file
2021-10-18 13:29:05,399 - INFO - Producing 21 bins...
2021-10-18 13:29:05,399 - INFO - Final binning results can be found in metacoag_output/bins/
2021-10-18 13:29:05,400 - INFO - Thank you for using MetaCoAG!

This one looks great: metacoag_output/contig_to_bin.tsv

I need to do a deeper dive into it but it looks like of my 2 fungal bins, it only pulled out one:

(quast_env) -bash-4.2$ cat fungus_1/report.tsv
Assembly    S005_R2_POST_PE_N728_S516_1_S23__METABAT2__P.1__bin.25  bin_24_seqs
# contigs (>= 0 bp) 275 513
# contigs (>= 1000 bp)  275 513
# contigs (>= 5000 bp)  252 305
# contigs (>= 10000 bp) 202 226
# contigs (>= 25000 bp) 113 125
# contigs (>= 50000 bp) 48  50
Total length (>= 0 bp)  8408957 9646160
Total length (>= 1000 bp)   8408957 9646160
Total length (>= 5000 bp)   8322356 9136763
Total length (>= 10000 bp)  7953034 8575973
Total length (>= 25000 bp)  6438700 6903210
Total length (>= 50000 bp)  4110190 4231481
# contigs   275 513
Largest contig  219195  219195
Total length    8408957 9646160
GC (%)  52.43   52.36
N50 47207   43533
N75 26911   21585
L50 50  64
L75 108 140
# N's per 100 kbp   338.21  306.43

Here's the comparison stats just using quast. I'll follow up with this when I get some more time. Lookign forward to exploring the tool a bit more.

Vini2 commented 1 year ago

Closing the issue as the feature was implemented. Please re-open the issue if required.