Closed apcamargo closed 1 year ago
After investigating some more, I don't think this is due to the dereplication process itself. It seems that the cluster representative (which I got from the family
column of the projection files) changes from execution to execution, even when I'm using --no_defrag
. Maybe the way MMseqs2 picks a cluster representative is not deterministic?
Hi,
The "defragmentation" algorithm is deterministic, given the same identical clustering it should provide the same results, if not it's a bug. The cluster representative though is indeed not, we've observed the same thing through time when using MMseqs2.
Normally, the ID should change between runs, but the genes clustered together should remain mostly clustered together.
Adelme
In the example above, when NC_011056_CDS_0040
was picked as the representative it caused a big change in the cluster composition, since several fragments aligned to it. I'm not really sure how common this effect is, though.
Hopefully it should be marginal. When the cluster representative is longer or shorter, or has a slightly different AA composition and the fragment is very close to the identity threshold, it might change the final results, but otherwise it should be globally similar.
At some point we tried to solve the topic of having "standardized" cluster representatives so make sure the clustering remains as close as possible between multiple runs, like picking always the longest sequence, or build an "average" representative sequence based on a MSA of the family, but that was looking like a never-ending souce of troubles so we gave up on it.
I think you're right, cases like this are probably rare. That said, I managed to reduce the problem by adding the --single-step-clustering
flag to the mmseqs cluster
execution. This makes the clustering process slower, as it skips the linclust step, but helps preventing cases like that. Maybe this could be an option in ppanggolin cluster
?
I suspect that using --mode 3
might alleviate the issue too, as it will pick the longest sequence as the representative.
I just found a fix for the inconsistent cluster representatives. MMseqs2 will provide deterministic cluster representatives as long as the input order doesn't change. In PPanGGOLiN, the order of the sequences in the gene FASTA file varies across different runs (I assume that's because of how Prodigal is parallelized). If we sort the FASTA file before mmseqs translatenucs
the clustering representative will always be the same.
My patchy solution was to add a seqkit sort
command in the first_clustering
function:
def first_clustering(sequences: io.TextIO, tmpdir: tempfile.TemporaryDirectory, cpu: int = 1, code: int = 11,
coverage: float = 0.8, identity: float = 0.8, mode: int = 1) -> (str, str):
"""
…
"""
sorted_sequences = tmpdir.name + '/sorted_gene_sequences.fna'
cmd = list(map(str, ["seqkit", "sort", "-n", sequences.name, "-o", sorted_sequences]))
logging.getLogger().debug(" ".join(cmd))
logging.getLogger().info("Sorting gene sequences...")
subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
seq_nucdb = tmpdir.name + '/nucleotid_sequences_db'
cmd = list(map(str, ["mmseqs", "createdb", sorted_sequences, seq_nucdb]))
…
A more elegant solution could be to turn Pangenome._geneGetter
into a sorted dictionary. The sorting key could be the gene accession/name.
Sounds pretty fixable to me, I'll take a look to add this. Not sure about making Pangenome._geneGetter a sorted dictionnary but I think I can make the order in which genes are written or read constant relatively easily, if that is all it needs.
Seems like this is a simple solution. I was thinking of something like sortedcontainers, but I don't think this is worth adding another dependency. I think you could just sort the CDS list in write_gene_sequences_from_annotations
using the CDS name as the key.
Thanks for working on this, @axbazin and @jpjarnoux
Do you guys have any plans to release a new version soon? I saw that you are adding a ton of stuff to PPanGGOLiN and just wanted to know if a release containing fixes is scheduled.
Hi, this was indeed fixed in the dev branch, but not yet released.
Unless there is change of plan the fix will get released with v2, there is no strict schedule yet but it's nearly complete, so likely in 2023.
Thanks for the info! Good to hear that the next release is almost ready :)
I'm evaluating PPanGGOLiN on MGE genomes and I noticed that some genomes contained multiple genes within the same cluster, which you wouldn't expect for very compact genomes. Upon further investigation I noticed that this is due to some frameshifts that split some genes in two parts that are then clustered together due to PPanGGOLiN's defragmentation process (which is great!)
However, upon executing the pipeline on the same set of genomes multiple times, I noticed that the number of occurrences of these "fake paralogs" vary a lot across executions. For example:
Execution 1:
Execution 2:
Could this be because of a non-deterministic behavior in the defragmentation algorithm? See below the data and code to reproduce the issue. Just run it a couple of times and you should see changes in the output.
votu_sequences.tar.gz