iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

make_prg update PR series: 4. make_prg update #36

Closed leoisl closed 2 years ago

leoisl commented 2 years ago

Hello, Proceeding with our PR series, this is the most crucial PR, which is the one that adds the make_prg update code itself. In the following, I will describe a general view of the code to help reviewing:

Source file: make_prg/update/denovo_variants.py

Class DenovoVariantsDB

Class used to parse a denovo variants file and provide an index to easily find the updates to be done to a given locus (through member variable self.locus_name_to_update_data). It is mostly composed of methods to parse the file and build objects based on the parsed data.

Class DenovoLocusInfo

Represents all the information of a denovo locus in a denovo variants file, e.g. sample, locus name, the ML path through the locus in that sample, and all denovo variants found for that locus for that sample. It is mainly used internally by DenovoVariantsDB to get update data for a given locus.

Class UpdateData

Trivial class used to hold updates to be done in a given node of a locus.

Class DenovoVariant

Represents a denovo variant with a ref and alt, and the chromosome would be a ML path through a locus. The ref can go through a single or multiple nodes of the ML path. If it goes through a single node, we can apply the variant to it and get a new sequence (get_mutated_sequence()). If it goes through multiple nodes, we can split this variant into a list of sub-variants WRT how it is distributed along the ML path (split_variant()).

Source file: make_prg/update/MLPath.py

Class MLPath

Represents a ML path that is indexed in two spaces (path and PRG spaces) for quick queries by other objects.

Class MLPathNode

Represents nodes of the ML path.

Source file: make_prg/recursion_tree.py

This source file contains the classes required to represent the recursion tree of the PRG explicitly. We have two types of nodes in our recursion tree: MultiClusterNode and SingleClusterNode. MultiClusterNode represents a node in the recursion tree such that its respective MSA has to be clustered in two or more clusters, e.g. a non-match interval with at least two groups of different enough sequences. MultiClusterNodes are never leaves. SingleClusterNode represents nodes such that the sequences of the respective MSA constitute a single cluster (e.g. when the sequences are too close to each other). It is also used when we still don't know how we will cluster the MSA (as concrete examples, the root is always a SingleClusterNode, as we don't know how it will be clustered yet, and children of MultiClusterNodes are always SingleClusterNodes). Depending on the MSA that the SingleClusterNode represents, it can have multiple MultiClusterNodes and SingleClusterNodes children.

In general, a node in the recursion tree stores a bunch of information that would match the recursive call at that point, e.g. nesting level, the MSA itself, parent node, its children, etc. Many of this information and methods are common between MultiClusterNodes and SingleClusterNodes, so the commonalities were refactored into a single abstract class RecursiveTreeNode. Finally, in order to build the PRG from a node, it is enough to traverse it in a preorder, which is the same order of the calls in a recursive algorithm (see preorder_traversal_to_build_prg()). MultiClusterNode has some unique behaviours, e.g. it is the only type of node that can cluster, while SingleClusterNode are the only type of nodes that can be updated.

This is an example of the recursion tree of a PRG with two MultiClusterNodes (in red) and several SingleClusterNodes (in blue). Internal MSAs are not represented. Site 15 is not broken into a MultiClusterNode because it is actually composed of a single cluster (this is the case when sequences are all too different). Sites 17 and 19 are not broken into MultiClusterNodes because it has too few unique sequences (this is a recent optimisation done by @bricoletc that was ported to this code).

image

Source file: make_prg/prg_builder.py

Class PrgBuilder

PRG builder based from a multiple sequence alignment. Similar to the existing PrgBuilder class, with a few differences, most notably:

  1. It now also owns a MSA aligner (self.aligner), so that it can rebuild some alignments during the update command;
  2. It has an index on the PRG space (self.prg_index);
  3. It points to a SingleClusterNode, which is the root of the recursion tree of this locus' PRG;
  4. It can be deserialised and serialised to disk;

Class PrgBuilderZipDatabase

Represents a collection of PrgBuilders, to be saved to and loaded from a zip file.

mbhall88 commented 2 years ago

Sorry, I don't think I can provide a meaningful review of this PR. There's so much new stuff here. I trust your thorough testing on the paper data etc.

bricoletc commented 2 years ago

MultiClusterNode represents a node in the recursion tree such that its respective MSA has to be clustered in two or more clusters, e.g. a non-match interval with at least two groups of different enough sequences. MultiClusterNodes are never leaves. SingleClusterNode represents nodes such that the sequences of the respective MSA constitute a single cluster (e.g. when the sequences are too close to each other). It is also used when we still don't know how we will cluster the MSA (as concrete examples, the root is always a SingleClusterNode, as we don't know how it will be clustered yet, and children of MultiClusterNodes are always SingleClusterNodes). Depending on the MSA that the SingleClusterNode represents, it can have multiple MultiClusterNodes and SingleClusterNodes children.

It's not clear to me what the difference really is. A SingleClusterNode may get clustered. Why isn't it?

This is an example of the recursion tree of a PRG with two MultiClusterNodes (in red) and several SingleClusterNodes (in blue).

What are the numbers inside the nodes? What's the difference between node 3 and node 2, other than their class membership?

bricoletc commented 2 years ago

The initial design of from_msa was a contrast to something like from_vcf. Now source files like prg_builder.py are under make_prg, not from_msa. I think the file hierarchy needs revisiting. Either we say make_prg will only ever support building from MSA, and call the subcommands build (instead of from_msa) and update, or make build and update dirs under from_msa?

bricoletc commented 2 years ago

As Michael, I don't feel qualified to review the completely new stuff- DeNovo, MLPaths, and trust your tests, but have left comments on the parts I think I understand/want to clarify. I think the codebase has become much larger and more intricate, but that doesn't mean it could be any simpler than what you've done :')

mbhall88 commented 2 years ago

The initial design of from_msa was a contrast to something like from_vcf. Now source files like prg_builder.py are under make_prg, not from_msa. I think the file hierarchy needs revisiting. Either we say make_prg will only ever support building from MSA, and call the subcommands build (instead of from_msa) and update, or make build and update dirs under from_msa?

I agree with this too. It seems like changing from_msa to simply build is much more future-proof. If we add the ability to build from a VCF, we don't need to add another subcommand. We can just infer the file type internally.

leoisl commented 2 years ago

It's not clear to me what the difference really is. A SingleClusterNode may get clustered. Why isn't it?

Yeah, some definitions between these two classes overlap. Will try to itemise the main differences between SingleClusterNode and MultiClusterNode here:

  1. SingleClusterNode are usually not clustered. When SingleClusterNodes create children, we ensure they are never going to be reclustered. However, SingleClusterNode may get clustered if it is the root or if it is child of a MultiClusterNode;
  2. Leaves are always SingleClusterNodes, never MultiClusterNodes;
  3. Children of MultiClusterNodes are always SingleClusterNodes, while SingleClusterNodes' children can be both;
  4. If some intervals of a SingleClusterNode must be clustered, it creates MultiClusterNodes children with the associated intervals, which will then be clustered.

We could indeed apply a check when creating MultiClusterNodes children so that we create a SingleClusterNode if the associated subalignment will not get clustered, or a MultiClusterNode if it will. Then we can also create a RootClusterNode class, which is the only one that might be clustered or not. This will indeed separate well the three classes:

  1. SingleClusterNode never get clustered, they are a single cluster;
  2. MultiClusterNode always get clustered;
  3. RootClusterNode may be both;

I am willing to implement these changes, as these definitions currently exist in the code, but their implementation does not totally satisfy them. They overlap a little bit, and this causes confusion. I will however wait for your answer to check if you think this is a reasonable solution or if you propose an alternative one.

What are the numbers inside the nodes?

When a node is a leaf, we print their associated PRG string. When it is not, we just print its ID, so we can locate it in a debugging session, for example. In an ideal world, instead of the ID, I would like to print their associated MSA, but that is not doable due to space constraints (only way to do this is to actually output a HTML view or sth like that, where when you click on a non-leaf node, it opens a small dialog with the associated MSA);

What's the difference between node 3 and node 2, other than their class membership?

They represent different MSAs, node 2 is parent of node 3, and its MSA include node 3's MSAs, as well as node 7's.

leoisl commented 2 years ago

The initial design of from_msa was a contrast to something like from_vcf. Now source files like prg_builder.py are under make_prg, not from_msa. I think the file hierarchy needs revisiting. Either we say make_prg will only ever support building from MSA, and call the subcommands build (instead of from_msa) and update, or make build and update dirs under from_msa?

I agree with this too. It seems like changing from_msa to simply build is much more future-proof. If we add the ability to build from a VCF, we don't need to add another subcommand. We can just infer the file type internally.

I agree with changing from_msa to simply build and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this, as it would cause a huge number of changes for just changing a bit the directory/module structure (unsure if git will understand we just reorganised the structure, or will put the changes as large deletions/insertions)?

mbhall88 commented 2 years ago

I agree with changing from_msa to simply build and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this, as it would cause a huge number of changes for just changing a bit the directory/module structure (unsure if git will understand we just reorganised the structure, or will put the changes as large deletions/insertions)?

Fine by me

bricoletc commented 2 years ago

I agree with changing from_msa to simply build and inferring the file type. Would this be fine? Also, could I leave this to an immediate PR after merging this

Yes ofc.

SingleClusterNode never get clustered, they are a single cluster; MultiClusterNode always get clustered; RootClusterNode may be both;

Yes that's already better for me. But...i don't think you've said this, but a SingleClusterNode is the only node that partitions an MSA into intervals as well. This was a crucial difference that I did not understand at first. In that vein it's misleading to say SingleClusterNode (in your proposed new definition) are a single cluster- they represent a list of intervals of single clusters, no?

Having said all this, can you explain why MultiClusterNode needs to exist as a class? In recursion_tree.py, the token exists in three places- I'm guessing it needs to be separate somehow for the purpose of de novo var updating?

Also, purely just for naming: image Nodes 3 and 7 are really MultiIntervalNodes, no? MultiIntervalNodes can point to two types of nodes: MultiClusterNodes (red) and SingleClusterNodes, which in my new naming would be leaves. In turn, MultiClusterNodes can point to MultiIntervalNodes or SingleClusterNodes (e.g., a cluster has too few seqs, seqs are too short, max nesting level reached, etc...). For me this is cleaner naming, however perhaps it is not useful for implementing update.

leoisl commented 2 years ago

Sorry for the long delay. https://github.com/iqbal-lab-org/make_prg/pull/36/commits/5bcf9f847ed795c72e848b49021013780221e46d describes a heavy refactoring of the code we use to represent the recursion tree explicitly. This is based on a meeting between @bricoletc and me, that we realised that the code was complicated to read and understand, and thus hard to maintain and error prone. This refactoring touches mostly the classes in make_prg/recursion_tree.py. Now we have 5 classes in it:

Basically, the difference between MultiIntervalNode and MultiClusterNode is just how we build the PRG from their children. Building the PRG of a LeafNode is basically describing the sequences themselves it represents, as it has no children. Besides, LeafNode has update capabilities. All the knowledge on which node to build (i.e. how to cluster the alignment - vertically, horizontally or no cluster) is concentrated on the NodeFactory class.

leoisl commented 2 years ago

@bricoletc I couldn't re-request a review from you, so I removed and readded you as a reviewer. I hope you got a notification or sth like that, but I am pinging you here just in case...

leoisl commented 2 years ago

Sorry for the huge delay @bricoletc - I got busy with roundhound as priority. I have finished this some weeks ago, but I am just pushing it now: https://github.com/iqbal-lab-org/make_prg/pull/36/commits/4298300cd713562fd6003102ad2671a6bfa0028b . If you are busy or if I took too long to fix, feel free to approve and we move on