Closed leoisl closed 2 years ago
Sorry, I don't think I can provide a meaningful review of this PR. There's so much new stuff here. I trust your thorough testing on the paper data etc.
MultiClusterNode represents a node in the recursion tree such that its respective MSA has to be clustered in two or more clusters, e.g. a non-match interval with at least two groups of different enough sequences. MultiClusterNodes are never leaves. SingleClusterNode represents nodes such that the sequences of the respective MSA constitute a single cluster (e.g. when the sequences are too close to each other). It is also used when we still don't know how we will cluster the MSA (as concrete examples, the root is always a SingleClusterNode, as we don't know how it will be clustered yet, and children of MultiClusterNodes are always SingleClusterNodes). Depending on the MSA that the SingleClusterNode represents, it can have multiple MultiClusterNodes and SingleClusterNodes children.
It's not clear to me what the difference really is. A SingleClusterNode
may get clustered. Why isn't it?
This is an example of the recursion tree of a PRG with two MultiClusterNodes (in red) and several SingleClusterNodes (in blue).
What are the numbers inside the nodes? What's the difference between node 3 and node 2, other than their class membership?
The initial design of from_msa
was a contrast to something like from_vcf
. Now source files like prg_builder.py are under make_prg
, not from_msa
. I think the file hierarchy needs revisiting. Either we say make_prg
will only ever support building from MSA, and call the subcommands build
(instead of from_msa
) and update
, or make build and update dirs under from_msa
?
As Michael, I don't feel qualified to review the completely new stuff- DeNovo, MLPaths, and trust your tests, but have left comments on the parts I think I understand/want to clarify. I think the codebase has become much larger and more intricate, but that doesn't mean it could be any simpler than what you've done :')
The initial design of
from_msa
was a contrast to something likefrom_vcf
. Now source files like prg_builder.py are undermake_prg
, notfrom_msa
. I think the file hierarchy needs revisiting. Either we saymake_prg
will only ever support building from MSA, and call the subcommandsbuild
(instead offrom_msa
) andupdate
, or make build and update dirs underfrom_msa
?
I agree with this too. It seems like changing from_msa
to simply build
is much more future-proof. If we add the ability to build from a VCF, we don't need to add another subcommand. We can just infer the file type internally.
It's not clear to me what the difference really is. A
SingleClusterNode
may get clustered. Why isn't it?
Yeah, some definitions between these two classes overlap. Will try to itemise the main differences between SingleClusterNode
and MultiClusterNode
here:
SingleClusterNode
are usually not clustered. When SingleClusterNode
s create children, we ensure they are never going to be reclustered. However, SingleClusterNode
may get clustered if it is the root or if it is child of a MultiClusterNode
;SingleClusterNode
s, never MultiClusterNode
s;MultiClusterNode
s are always SingleClusterNode
s, while SingleClusterNode
s' children can be both;SingleClusterNode
must be clustered, it creates MultiClusterNode
s children with the associated intervals, which will then be clustered.We could indeed apply a check when creating MultiClusterNode
s children so that we create a SingleClusterNode
if the associated subalignment will not get clustered, or a MultiClusterNode
if it will. Then we can also create a RootClusterNode
class, which is the only one that might be clustered or not. This will indeed separate well the three classes:
SingleClusterNode
never get clustered, they are a single cluster;MultiClusterNode
always get clustered;RootClusterNode
may be both;I am willing to implement these changes, as these definitions currently exist in the code, but their implementation does not totally satisfy them. They overlap a little bit, and this causes confusion. I will however wait for your answer to check if you think this is a reasonable solution or if you propose an alternative one.
What are the numbers inside the nodes?
When a node is a leaf, we print their associated PRG string. When it is not, we just print its ID, so we can locate it in a debugging session, for example. In an ideal world, instead of the ID, I would like to print their associated MSA, but that is not doable due to space constraints (only way to do this is to actually output a HTML view or sth like that, where when you click on a non-leaf node, it opens a small dialog with the associated MSA);
What's the difference between node 3 and node 2, other than their class membership?
They represent different MSAs, node 2
is parent of node 3
, and its MSA include node 3
's MSAs, as well as node 7
's.
The initial design of
from_msa
was a contrast to something likefrom_vcf
. Now source files like prg_builder.py are undermake_prg
, notfrom_msa
. I think the file hierarchy needs revisiting. Either we saymake_prg
will only ever support building from MSA, and call the subcommandsbuild
(instead offrom_msa
) andupdate
, or make build and update dirs underfrom_msa
?I agree with this too. It seems like changing
from_msa
to simplybuild
is much more future-proof. If we add the ability to build from a VCF, we don't need to add another subcommand. We can just infer the file type internally.
I agree with changing from_msa
to simply build
and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this, as it would cause a huge number of changes for just changing a bit the directory/module structure (unsure if git will understand we just reorganised the structure, or will put the changes as large deletions/insertions)?
I agree with changing from_msa to simply build and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this, as it would cause a huge number of changes for just changing a bit the directory/module structure (unsure if git will understand we just reorganised the structure, or will put the changes as large deletions/insertions)?
Fine by me
I agree with changing from_msa to simply build and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this
Yes ofc.
SingleClusterNode never get clustered, they are a single cluster; MultiClusterNode always get clustered; RootClusterNode may be both;
Yes that's already better for me.
But...i don't think you've said this, but a SingleClusterNode
is the only node that partitions an MSA into intervals as well. This was a crucial difference that I did not understand at first.
In that vein it's misleading to say SingleClusterNode
(in your proposed new definition) are a single cluster- they represent a list of intervals of single clusters, no?
Having said all this, can you explain why MultiClusterNode
needs to exist as a class? In recursion_tree.py
, the token exists in three places- I'm guessing it needs to be separate somehow for the purpose of de novo var updating?
Also, purely just for naming:
Nodes 3 and 7 are really MultiIntervalNode
s, no? MultiIntervalNode
s can point to two types of nodes: MultiClusterNode
s (red) and SingleClusterNode
s, which in my new naming would be leaves. In turn, MultiClusterNode
s can point to MultiIntervalNode
s or SingleClusterNode
s (e.g., a cluster has too few seqs, seqs are too short, max nesting level reached, etc...). For me this is cleaner naming, however perhaps it is not useful for implementing update
.
Sorry for the long delay. https://github.com/iqbal-lab-org/make_prg/pull/36/commits/5bcf9f847ed795c72e848b49021013780221e46d describes a heavy refactoring of the code we use to represent the recursion tree explicitly. This is based on a meeting between @bricoletc and me, that we realised that the code was complicated to read and understand, and thus hard to maintain and error prone. This refactoring touches mostly the classes in make_prg/recursion_tree.py
. Now we have 5 classes in it:
RecursiveTreeNode
: this is an abstract base class for the nodes of the recursion tree, abstracting its common attributes and methods;MultiIntervalNode
: Represent nodes that are clustered vertically, i.e. in intervals;MultiClusterNode
: Represent nodes that are clustered horizontally, i.e. in sequence clusters;LeafNode
: Represent nodes that are never clustered in either direction. These nodes are the only ones that can get updated and indexed;NodeFactory
: Class responsible to build the different RecursiveTreeNode
s depending on its alignment;Basically, the difference between MultiIntervalNode
and MultiClusterNode
is just how we build the PRG from their children. Building the PRG of a LeafNode
is basically describing the sequences themselves it represents, as it has no children. Besides, LeafNode
has update capabilities. All the knowledge on which node to build (i.e. how to cluster the alignment - vertically, horizontally or no cluster) is concentrated on the NodeFactory
class.
@bricoletc I couldn't re-request a review from you, so I removed and readded you as a reviewer. I hope you got a notification or sth like that, but I am pinging you here just in case...
Sorry for the huge delay @bricoletc - I got busy with roundhound as priority. I have finished this some weeks ago, but I am just pushing it now: https://github.com/iqbal-lab-org/make_prg/pull/36/commits/4298300cd713562fd6003102ad2671a6bfa0028b . If you are busy or if I took too long to fix, feel free to approve and we move on
Hello, Proceeding with our PR series, this is the most crucial PR, which is the one that adds the
make_prg update
code itself. In the following, I will describe a general view of the code to help reviewing:Source file:
make_prg/update/denovo_variants.py
Class
DenovoVariantsDB
Class used to parse a denovo variants file and provide an index to easily find the updates to be done to a given locus (through member variable
self.locus_name_to_update_data
). It is mostly composed of methods to parse the file and build objects based on the parsed data.Class
DenovoLocusInfo
Represents all the information of a denovo locus in a denovo variants file, e.g. sample, locus name, the ML path through the locus in that sample, and all denovo variants found for that locus for that sample. It is mainly used internally by
DenovoVariantsDB
to get update data for a given locus.Class
UpdateData
Trivial class used to hold updates to be done in a given node of a locus.
Class
DenovoVariant
Represents a denovo variant with a ref and alt, and the chromosome would be a ML path through a locus. The ref can go through a single or multiple nodes of the ML path. If it goes through a single node, we can apply the variant to it and get a new sequence (
get_mutated_sequence()
). If it goes through multiple nodes, we can split this variant into a list of sub-variants WRT how it is distributed along the ML path (split_variant()
).Source file:
make_prg/update/MLPath.py
Class
MLPath
Represents a ML path that is indexed in two spaces (path and PRG spaces) for quick queries by other objects.
Class
MLPathNode
Represents nodes of the ML path.
Source file:
make_prg/recursion_tree.py
This source file contains the classes required to represent the recursion tree of the PRG explicitly. We have two types of nodes in our recursion tree:
MultiClusterNode
andSingleClusterNode
.MultiClusterNode
represents a node in the recursion tree such that its respective MSA has to be clustered in two or more clusters, e.g. a non-match interval with at least two groups of different enough sequences.MultiClusterNode
s are never leaves.SingleClusterNode
represents nodes such that the sequences of the respective MSA constitute a single cluster (e.g. when the sequences are too close to each other). It is also used when we still don't know how we will cluster the MSA (as concrete examples, the root is always aSingleClusterNode
, as we don't know how it will be clustered yet, and children ofMultiClusterNode
s are alwaysSingleClusterNode
s). Depending on the MSA that theSingleClusterNode
represents, it can have multipleMultiClusterNode
s andSingleClusterNode
s children.In general, a node in the recursion tree stores a bunch of information that would match the recursive call at that point, e.g. nesting level, the MSA itself, parent node, its children, etc. Many of this information and methods are common between
MultiClusterNode
s andSingleClusterNode
s, so the commonalities were refactored into a single abstract classRecursiveTreeNode
. Finally, in order to build the PRG from a node, it is enough to traverse it in a preorder, which is the same order of the calls in a recursive algorithm (seepreorder_traversal_to_build_prg()
).MultiClusterNode
has some unique behaviours, e.g. it is the only type of node that can cluster, whileSingleClusterNode
are the only type of nodes that can be updated.This is an example of the recursion tree of a PRG with two
MultiClusterNode
s (in red) and severalSingleClusterNode
s (in blue). Internal MSAs are not represented. Site 15 is not broken into aMultiClusterNode
because it is actually composed of a single cluster (this is the case when sequences are all too different). Sites 17 and 19 are not broken intoMultiClusterNode
s because it has too few unique sequences (this is a recent optimisation done by @bricoletc that was ported to this code).Source file:
make_prg/prg_builder.py
Class
PrgBuilder
PRG builder based from a multiple sequence alignment. Similar to the existing
PrgBuilder
class, with a few differences, most notably:self.aligner
), so that it can rebuild some alignments during the update command;self.prg_index
);SingleClusterNode
, which is the root of the recursion tree of this locus' PRG;Class
PrgBuilderZipDatabase
Represents a collection of
PrgBuilder
s, to be saved to and loaded from a zip file.