metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
370 stars 97 forks source link

refinning MAGs #324

Closed botellaflotante closed 3 years ago

botellaflotante commented 3 years ago

Is it possible to use refinem (https://github.com/dparks1134/RefineM) directly from ATLAS output? which files should I use as scaffolds/bins/bam files. Do they have similar headers so this tool can be used?

Thanks

SilasK commented 3 years ago

Cool that you point this out.

To get started the command

refinem scaffold_stats -c 16 <scaffold_file> <bin_dir> <stats_output_dir> <bam_files>

Would translate to

refinem scaffold_stats -c 16 {sample}/{sample}_contigs.fasta {sample}/{final_binner}/bins {your choice} {sample}/sequence_alignment/*.bam

I suggest you, using metabat as final dinner, so you don't need to pass the binning through DAS tool and then refinem.

I'm eager to know if it improves your bins. Maybe I can add it to atlas if it is convincing.

botellaflotante commented 3 years ago

ok great!, it worked with a previous samtools index step on the .bam files. But then I would exclude the maxbin output, right? can I ask you why is it better to use metabat instead of DAStool bins? just to save some step or is there other reason? then, if it improves, is it possible to use these bins again easily in atlas to get the final MAGs as usual?

thanks a lot

SilasK commented 3 years ago

But then I would exclude the maxbin output, right? can I ask you why is it better to use metabat instead of DAStool bins? just to save some step or is there other reason?

No this is more or less the reason. I like to know which tool is doing what.

What I would find the best way is to create a folder {sample}/binning/refinem/ and then create a file therein called cluster_attribution.tsv, that maps contigs to bin.

Here is the code to create such a file if you have only the fasta files of the bins.

Then if you set final_binner: enrichm atlas should take this file, run checkM and continue the pipeline.

If you want and you have many samples, I can help you to implement this in atlas snakemake. E.g. that the enrichM is performed as part of Atlas. But may it is worth testing if there is an improvement.

SilasK commented 3 years ago

Could you solve the Cyclic dependency problem? I also encountered, I should fix it in an update of Atlas.

botellaflotante commented 3 years ago

yes, I had erased some important directory (genomes) for the binning step, I think that was the problem. Now I am just changing the config file to metabat as final binner and repeating without removing anything. I was to run some other samples just to be sure if there is a real improvement or not with refinem. I just got this for one sample (before refinem: red, after refinem: blue)... refinem

SilasK commented 3 years ago

Let's say bins with contamination >10-20% or completeness < 50% are uninteresting.

Then you have two before and three after refineM, isn't it? However you 2 best bins loose 10% completeness.

botellaflotante commented 3 years ago

yep. I don't like it either. I will check some other samples and show you. Also I will try with the phylogeny option, because this was tetranucleotides and coverage option...

SilasK commented 3 years ago

I had similar experiences with magpurify. May be you want also to try this tool. https://github.com/dib-lab/charcoal/tree/latest/doc

botellaflotante commented 3 years ago

I could run refineM phylogeny mode after correcting a python bug it had, so, in order to run refinem as final binner, I only need to change this in the config file and just "atlas run all", or should I remove something before? let's see if it improves or not. I tried with charcoal but could not get it running...

SilasK commented 3 years ago

Cool, above I explained how to integrate a new binner into atlas. And then you can run 'run atlas all' or 'run atlas binning' both produce a binning report in reports directory. I looking forward to seeing if it helps something.

SilasK commented 3 years ago

I just ping @ctb, to say that you tried charcoal without success.

botellaflotante commented 3 years ago

I send you some ugly plots for 5 samples, comparing completeness and contamination before (red) and after (blue/green) refinem, in taxonomy mode. it improves a little in general, but I would say that contamination is mostly from very similar strains with similar TNF and coverages... I guess this must be THE huge problem in genome reconstruction from metagenomes... right?

link to plots:

https://drive.google.com/file/d/1pGmEKG2Q01ayoT_TOU_QCUUwTd2xMl_p/view?usp=sharing

SilasK commented 3 years ago

Thank you very much for your results. You would also say the results don’t seem convincing.

I don’t know If one can say in general that there is strain contamination. But yes the more similar TNF and abundance are the more complicated it is to bin genomes correctly.

There is a field in the checkM results that states if contamination is expected from a similar strain or not.

ctb commented 3 years ago

first, yes, sorry, charcoal is in a broken state at the moment :(

I would say that contamination is mostly from very similar strains with similar TNF and coverages... I guess this must be THE huge problem in genome reconstruction from metagenomes... right?

It's definitely one of them :). You would certainly expect this to be a problem in every MAG workflow I have seen, based on the way assembly and binning work. I think it's unresolvable in that sense, without doing something quite different in the graph (see e.g. https://github.com/chrisquince/STRONG for a promising approach that could be applied to contamination).

But we also see a surprising amount of cross-everything contamination in large MAG data sets, e.g. see some charcoal output here. I think that's a big practical problem because the databases are getting contaminated with wildly divergent taxonomic classifications...

SilasK commented 3 years ago

Thank you for your comment. Do you think the coassembly can really help to disentangle the genomes?

ctb commented 3 years ago

this is territory where I only have the vaguest of data, so mainly based on intuition, but - the information is in the reads, and we should be able to disentangle it! I don't think we can rely on co-assembly the way it's currently done tho.

(for some data on this with our tools, see this comment, where we can clearly see multiple peaks for two different strain variants in the read abundance data from a single sample; with multiple samples, colored De Bruijn graphs should be able to disentangle the strains with some reasonable precision, and I think STRONG is a promising step towards actually showing that can work, albeit in a ~reference-based way.)

(I'm in no way claiming that our tools are special here, it's just a figure I had ready to link :)

(Also, I am not an author on STRONG!)