rhysnewell / rosella

Metagenomic Binning Algorithm
BSD 3-Clause "New" or "Revised" License
38 stars 3 forks source link

New Outputs! #51

Open aljazdzy opened 8 months ago

aljazdzy commented 8 months ago

This isn't really an issue as much as an output question: I just updated to the newest version of rosella (hadn't done so in a bit) and was excited to see a bunch of new outputs I didn't have previously! These include: rosella_refined0"number" rosella_refined_0_single_contig_refined0"number" rosella_refined_0_unbinned rosella_bin_small_unbinned rosella_bin_unbinned

Most of them are pretty obvious as to what they are - I would think rosella_refined_0_unbinned would have unbinned contigs above a certain threshold, then "small_unbinned" would contain contigs below that threshold. I also would hypothesize that "single_contig_refined" contains maybe very large contigs that didn't have many otherwise clusters? But are clustered together? (I realize it says "single contig" but when I open the files they seem to contain at least 3 very large contigs). I'm not entirely sure what the "refined0"number"" bins are though, are these refined versions of the original output bins? I ran recover but is the program also running refine?

Any clarification would be greatly appreciated, I'm excited for the extra bit of data!

rhysnewell commented 6 months ago

Hello @aljazdzy, Apologies for delayed response here, I've been on break.

I'll add some documentation in future to clarify these outputs, but yes your deductions are all correct. The refined bins are normal output from rosella, they were just produced from putative bins (that aren't included in the final output) during the initial round of refinement that rosella performs. They have the "refined" tag just to show that they were produced from the second step and not the first.

I am kind of confused about that bin called "single_contig" that has 3 contigs in it, when reviewing the code that shouldn't necessarily happen. Would you be able to provide some additional information about it? Like does it look like a legitimate bin or is it highly contaminated?

Additionally, in general were the bins of similar quality to previous rosella runs? This update is a fairly large refactor to address some speed issues but I'm working on this in my spare time, so I just want to make sure I haven't missed any bugs.

Cheers, Rhys

aljazdzy commented 5 months ago

No issues, I apologize for my delay as well! Yes when I run those contigs through checkm2 I get output that looks like this:


rosella_refined_0_single_contig_refined_0_1 93.4    43.98   Gradient Boost (General Model)  11  0.89    562115  289.4317697228145   3648733 0.47    3752    None

rosella_refined_0_single_contig_refined_0_2 68.71   8.97    Gradient Boost (General Model)  11  0.903   712182  308.4401823015572   2688388 0.52    2633    None

rosella_refined_0_single_contig_refined_0_3 44.11   3.47    Gradient Boost (General Model)  11  0.868   298476  305.5108924806746   1498363 0.45    1423    None

rosella_refined_0_single_contig_refined_0_4 28.29   1.04    Neural Network (Specific Model) 11  0.916   281267  336.65242165242165  772218  0.44    702 None

rosella_refined_0_single_contig_refined_0_5 99.99   7.88    Neural Network (Specific Model) 11  0.886   2946550 303.839142948513    3206005 0.47    3127    None

rosella_refined_0_single_contig_refined_0_6 17.88   0.17    Neural Network (Specific Model) 11  0.89    254714  325.07439824945294  499452  0.43    457 None

rosella_refined_0_single_contig_refined_0_7 11.84   0.03    Neural Network (Specific Model) 11  0.844   298451  308.6630036630037   298451  0.48    273 None
``

So some of them are quite decent but others have some significant issues.  I would like to say the bins were in general of higher quality that previous runs, but I'll admit I don't have a quantitative analysis on that quite yet.  My bins in general haven't been of great quality but that's more-to-do with my data.  I'm hoping to change that soon though.