merenlab / oligotyping

Exploring microbial patterns through subtle nucleotide variation within 16S rRNA gene tag sequences of closely related taxa
GNU General Public License v2.0
40 stars 22 forks source link

Unexpected loss of sequences during decomposition #34

Closed didillysquat closed 3 years ago

didillysquat commented 3 years ago

Hi Meren et al.

Firstly, thanks for your work with MED. I've been using it for a couple of years now with SymPortal (symportal.org). In particular I find that it keeps sequeneces that I know to be biologically genuine but that are thrown out by DADA2.

However, I've recently noticed an issue where I'm losing a large proportion of sequences during decomposition. E.g. I start with 3045 sequences (54 unique sequences; one of which is found 2641 times), but end up with about 70 (6 unique sequences) after decomposition. Looking at the distribution of the sequences inside the input fasta I can't explain why this would be happening. I would at least expect that abundant sequence to be returned at a high abundance.

I run a version of MED in the SymPortal framework that I ported to python3 back when you were still running with python2, but have since tried the decomposition with your latest version from github and get exactly the same result. I am running with python 3.8 so have to turn off the multi-threading as the med code is not compatible with higher python versions (I had the same pickle issues myself when I upgraded SymPortal to run on 3.8).

I would be greatly appreciative if you could explain to me, according to the MED logic, why I'm losing the sequences? And, if there's a parameter that would prevent this from occurring.

The fasta I am decomposing (after padding).

Command run:

./decompose -M 12 -T --skip-gen-html --skip-gen-figures --skip-check-input-file -o ../../med_test/results/  ../../med_test/seqs_for_med_MesoF_ID31_clade_C.redundant.padded.fasta

STDOUT:

Project ..........................................................: seqs_for_med_MesoF_ID31_clade_C
Run date .........................................................: 06 Mar 21 06:58:18
Library version ..................................................: 3-master
Command line .....................................................: ./decompose -m 12 -T --skip-gen-html --skip-gen-figures --skip-check-input-file -o ../../med_test/results/ ../../med_test/seqs_for_med_MesoF_ID31_clade_C.redundant.padded.fasta
Multi-threaded ...................................................: False
Extraction info output file ......................................: ../../med_test/results/RUNINFO
Log file path ....................................................: ../../med_test/results/RUNINFO.log
Input file .......................................................: ../../med_test/seqs_for_med_MesoF_ID31_clade_C.redundant.padded.fasta
Mapping file .....................................................: None
Quick (and dirty) analysis requested .............................: False
Merge homopolymer splits .........................................: False
Skip removing outliers ...........................................: False
Try to relocate outliers .........................................: False
Store topology dict ..............................................: False
Skip generating figures post analysis ............................: True
Min entropy for a component to be picked for decomposition .......: 12.0
Perform entropy normalization heuristics .........................: True
Max number of discriminants to use for decomposition .............: 4
Min total abundance of oligotype in all samples ..................: 0
Min substantive abundance of an oligotype (-M) ...................: 4                                                                                                                               
Maximum variation allowed in each node (-V) ......................: 3
Number of sequences analyzed .....................................: 3,045
Average read length (without gaps) ...............................: 263
Number of characters in each alignment ...........................: 267
Output directory .................................................: ../../med_test/results/
Directory to store final nodes ...................................: ../../med_test/results/NODES
Temporary files directory ........................................: ../../med_test/results/TMP
Figures directory ................................................: ../../med_test/results/FIGURES
Number of raw nodes (before the refinement) ......................: 1                                                                                                                               
Outliers removed due to -V .......................................: 30                                                                                                                              
Total number of outliers removed during the refinement ...........: 30
Number of samples found ..........................................: 1                                                                                                                               
Number of final nodes (after the refinement) .....................: 6
Number of sequences represented after quality filtering ..........: 73
Final number of outliers due to -V ...............................: 30
Final total number of outliers ...................................: 30
Environment file .................................................: ../../med_test/results/ENVIRONMENT.txt                                                                                          
Sample/oligotype abundance data matrix (counts) ..................: ../../med_test/results/MATRIX-COUNT.txt                                                                                         
Sample/oligotype abundance data matrix (percents) ................: ../../med_test/results/MATRIX-PERCENT.txt
Basic topology of MED nodes (txt) ................................: ../../med_test/results/TOPOLOGY.txt                                                                                             
Basic topology of MED nodes ......................................: ../../med_test/results/TOPOLOGY.gexf
Representative sequences per node ................................: ../../med_test/results/NODE-REPRESENTATIVES.fasta                                                                               
Read distribution among samples table ............................: ../../med_test/results/READ-DISTRIBUTION.txt                                                                                    
End of run .......................................................: 06 Mar 21 06:58:20
GEXF file for network analysis ...................................: ../../med_test/results/NETWORK.gexf
didillysquat commented 3 years ago

Debugging through the code:

The root node is firstly not decomposed because the following evaluates to True immediately (i.e. before any iterations of decomposition):

if node.competing_unique_sequences_ratio < 0.0005 or node.density > 0.85:
    # Finalize this node.
    self.logger.info('finalize node (CUSR/ND): %s' % node_id)
    continue

Then in topology refinement, three zombie nodes are formed due to being different from the root node (these end up as the only three final nodes).

Then while 'updating' the topology, the root node is lost, but the three zombie nodes are kept as the final node due to the following comprehension:

final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children]

I can see why it would normally be OK to discard the nodes with children as these should have been decomposed and you would only need to keep the children. However, in this case, the root node was never decomposed (for whatever reason; lack of entropy) and so should surely also be considered as a final node. As such could the statement:

final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children]

be modified to:

# min_substantive_abundance has been passed in to the function from the Decomposer class
# I'm assuming here that the 'reads' list of a node is sorted by abundance
final_nodes_tpls = [(self.nodes[n].size, n) for n in self.alive_nodes if not self.nodes[n].children or self.nodes[n].reads[0].frequency > min_substantive_abundance]

Here, we would keep a node with children if it still meets the min_substantive_abundance abundance criteria.

Does this work?

Thanks.

meren commented 3 years ago

Hi @didillysquat,

Apologies for the late response. As you can see from the codebase, it has been a long time since the last time I changed the code and/or made any improvements to MED.

I apologize for this bug, which seems to be relevant only if the root node is not decomposed at all, a rare case I clearly didn't take into consideration, and thank you for looking into it and finding a solution even!

I think your solution should only apply to the root node, in cases where decomposition never reached to level 1. If it is working for you, and if it is not changing the results from your previous analyses, I'd be happy to include it in the repo if you send a PR as you are very likely more on top of MED than I am at the moment :)

Best wishes,

didillysquat commented 3 years ago

Hi @meren,

Thanks for your reply!

Sorry to be dragging up the past but I still use MED regularly.

I'll implement a PR later this week based on your suggestion (i.e. only root node, and explicitly checking for the decomposition level).

Cheers,

Ben