Closed didillysquat closed 3 years ago
Debugging through the code:
The root node is firstly not decomposed because the following evaluates to True immediately (i.e. before any iterations of decomposition):
if node.competing_unique_sequences_ratio < 0.0005 or node.density > 0.85:
# Finalize this node.
self.logger.info('finalize node (CUSR/ND): %s' % node_id)
continue
Then in topology refinement, three zombie nodes are formed due to being different from the root node (these end up as the only three final nodes).
Then while 'updating' the topology, the root node is lost, but the three zombie nodes are kept as the final node due to the following comprehension:
final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children]
I can see why it would normally be OK to discard the nodes with children as these should have been decomposed and you would only need to keep the children. However, in this case, the root node was never decomposed (for whatever reason; lack of entropy) and so should surely also be considered as a final node. As such could the statement:
final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children]
be modified to:
# min_substantive_abundance has been passed in to the function from the Decomposer class
# I'm assuming here that the 'reads' list of a node is sorted by abundance
final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children or self.nodes[n].reads[0].frequency > min_substantive_abundance]
Here, we would keep a node with children if it still meets the min_substantive_abundance abundance criteria.
Does this work?
Thanks.
Hi @didillysquat,
Apologies for the late response. As you can see from the codebase, it has been a long time since the last time I changed the code and/or made any improvements to MED.
I apologize for this bug, which seems to be relevant only if the root node is not decomposed at all, a rare case I clearly didn't take into consideration, and thank you for looking into it and finding a solution even!
I think your solution should only apply to the root node, in cases where decomposition never reached to level 1. If it is working for you, and if it is not changing the results from your previous analyses, I'd be happy to include it in the repo if you send a PR as you are very likely more on top of MED than I am at the moment :)
Best wishes,
Hi @meren,
Thanks for your reply!
Sorry to be dragging up the past but I still use MED regularly.
I'll implement a PR later this week based on your suggestion (i.e. only root node, and explicitly checking for the decomposition level).
Cheers,
Ben
Hi Meren et al.
Firstly, thanks for your work with MED. I've been using it for a couple of years now with SymPortal (symportal.org). In particular I find that it keeps sequeneces that I know to be biologically genuine but that are thrown out by DADA2.
However, I've recently noticed an issue where I'm losing a large proportion of sequences during decomposition. E.g. I start with 3045 sequences (54 unique sequences; one of which is found 2641 times), but end up with about 70 (6 unique sequences) after decomposition. Looking at the distribution of the sequences inside the input fasta I can't explain why this would be happening. I would at least expect that abundant sequence to be returned at a high abundance.
I run a version of MED in the SymPortal framework that I ported to python3 back when you were still running with python2, but have since tried the decomposition with your latest version from github and get exactly the same result. I am running with python 3.8 so have to turn off the multi-threading as the med code is not compatible with higher python versions (I had the same pickle issues myself when I upgraded SymPortal to run on 3.8).
I would be greatly appreciative if you could explain to me, according to the MED logic, why I'm losing the sequences? And, if there's a parameter that would prevent this from occurring.
The fasta I am decomposing (after padding).
Command run:
STDOUT: