Non-deterministic clustering (possibly due to defragmentation)

apcamargo commented 1 year ago

I'm evaluating PPanGGOLiN on MGE genomes and I noticed that some genomes contained multiple genes within the same cluster, which you wouldn't expect for very compact genomes. Upon further investigation I noticed that this is due to some frameshifts that split some genes in two parts that are then clustered together due to PPanGGOLiN's defragmentation process (which is great!)

However, upon executing the pipeline on the same set of genomes multiple times, I noticed that the number of occurrences of these "fake paralogs" vary a lot across executions. For example:

Execution 1:

$ awk 'FNR>1 && $7>1 {print $0}' votu_sequences_ppanggolin/projection/*

JF937096_CDS_0027   JF937096    27523   27918   +   KX817173_CDS_0028   2   shell   000
JF937096_CDS_0028   JF937096    27878   28468   +   KX817173_CDS_0028   2   shell   000
MK016491_CDS_0085   MK016491    54635   55321   +   NC_022065_CDS_0083  2   persistent0 0   0
MK016491_CDS_0086   MK016491    55790   56464   +   NC_022065_CDS_0083  2   persistent0 0   0

Execution 2:

$ awk 'FNR>1 && $7>1 {print $0}' votu_sequences_ppanggolin/projection/*

JF937096_CDS_0027   JF937096    27523   27918   +   MG872843_CDS_0027   2   shell   000
JF937096_CDS_0028   JF937096    27878   28468   +   MG872843_CDS_0027   2   shell   000
JN006062_CDS_0040   JN006062    34311   34775   -   NC_011056_CDS_0040  2   shell   000
JN006062_CDS_0043   JN006062    35202   35414   -   NC_011056_CDS_0040  2   shell   000
JN391441_CDS_0039   JN391441    33658   34122   -   NC_011056_CDS_0040  2   shell   000
JN391441_CDS_0040   JN391441    34119   34382   -   NC_011056_CDS_0040  2   shell   000
KT020852_CDS_0039   KT020852    33927   34391   -   NC_011056_CDS_0040  2   shell   000
KT020852_CDS_0042   KT020852    34818   35030   -   NC_011056_CDS_0040  2   shell   000
KX834009_CDS_0039   KX834009    33697   34161   -   NC_011056_CDS_0040  2   shell   000
KX834009_CDS_0042   KX834009    34588   34800   -   NC_011056_CDS_0040  2   shell   000
KY549152_CDS_0040   KY549152    33655   34119   -   NC_011056_CDS_0040  2   shell   000
KY549152_CDS_0043   KY549152    34546   34758   -   NC_011056_CDS_0040  2   shell   000
MF919506_CDS_0038   MF919506    34216   34680   -   NC_011056_CDS_0040  2   shell   000
MF919506_CDS_0041   MF919506    35107   35319   -   NC_011056_CDS_0040  2   shell   000
MF919525_CDS_0037   MF919525    34172   34636   -   NC_011056_CDS_0040  2   shell   000
MF919525_CDS_0040   MF919525    35063   35275   -   NC_011056_CDS_0040  2   shell   000
MG757160_CDS_0041   MG757160    34198   34662   -   NC_011056_CDS_0040  2   shell   000
MG757160_CDS_0044   MG757160    35089   35301   -   NC_011056_CDS_0040  2   shell   000
MG872831_CDS_0038   MG872831    33049   33513   -   NC_011056_CDS_0040  2   shell   000
MG872831_CDS_0041   MG872831    33940   34152   -   NC_011056_CDS_0040  2   shell   000
MG872832_CDS_0038   MG872832    33049   33513   -   NC_011056_CDS_0040  2   shell   000
MG872832_CDS_0041   MG872832    33940   34152   -   NC_011056_CDS_0040  2   shell   000
MG872837_CDS_0041   MG872837    34195   34659   -   NC_011056_CDS_0040  2   shell   000
MG872837_CDS_0044   MG872837    35086   35298   -   NC_011056_CDS_0040  2   shell   000
MG872843_CDS_0039   MG872843    34159   34623   -   NC_011056_CDS_0040  2   shell   000
MG872843_CDS_0042   MG872843    35050   35262   -   NC_011056_CDS_0040  2   shell   000
MH000607_CDS_0039   MH000607    33943   34407   -   NC_011056_CDS_0040  2   shell   000
MH000607_CDS_0042   MH000607    34834   35046   -   NC_011056_CDS_0040  2   shell   000
MH371112_CDS_0041   MH371112    34712   35176   -   NC_011056_CDS_0040  2   shell   000
MH371112_CDS_0044   MH371112    35603   35815   -   NC_011056_CDS_0040  2   shell   000
MH399778_CDS_0038   MH399778    33863   34327   -   NC_011056_CDS_0040  2   shell   000
MH399778_CDS_0041   MH399778    34754   34966   -   NC_011056_CDS_0040  2   shell   000
MH536829_CDS_0041   MH536829    34195   34659   -   NC_011056_CDS_0040  2   shell   000
MH536829_CDS_0044   MH536829    35086   35298   -   NC_011056_CDS_0040  2   shell   000
MH590587_CDS_0039   MH590587    34357   34821   -   NC_011056_CDS_0040  2   shell   000
MH590587_CDS_0042   MH590587    35248   35460   -   NC_011056_CDS_0040  2   shell   000
MH779503_CDS_0041   MH779503    34712   35176   -   NC_011056_CDS_0040  2   shell   000
MH779503_CDS_0044   MH779503    35603   35815   -   NC_011056_CDS_0040  2   shell   000
MK016491_CDS_0085   MK016491    54635   55321   +   MT114165_CDS_0086   2   persistent0 0   0
MK016491_CDS_0086   MK016491    55790   56464   +   MT114165_CDS_0086   2   persistent0 0   0
MK433262_CDS_0038   MK433262    34430   34894   -   NC_011056_CDS_0040  2   shell   000
MK433262_CDS_0041   MK433262    35321   35533   -   NC_011056_CDS_0040  2   shell   000
MK559429_CDS_0040   MK559429    33947   34414   -   NC_011056_CDS_0040  2   shell   000
MK559429_CDS_0044   MK559429    35203   35415   -   NC_011056_CDS_0040  2   shell   000
MN586019_CDS_0039   MN586019    33946   34410   -   NC_011056_CDS_0040  2   shell   000
MN586019_CDS_0042   MN586019    34837   35049   -   NC_011056_CDS_0040  2   shell   000
MN586032_CDS_0039   MN586032    34455   34919   -   NC_011056_CDS_0040  2   shell   000
MN586032_CDS_0042   MN586032    35346   35558   -   NC_011056_CDS_0040  2   shell   000
MN586034_CDS_0041   MN586034    34195   34659   -   NC_011056_CDS_0040  2   shell   000
MN586034_CDS_0044   MN586034    35086   35298   -   NC_011056_CDS_0040  2   shell   000
MN586037_CDS_0041   MN586037    34195   34659   -   NC_011056_CDS_0040  2   shell   000
MN586037_CDS_0044   MN586037    35086   35298   -   NC_011056_CDS_0040  2   shell   000
MT723943_CDS_0040   MT723943    33416   33880   -   NC_011056_CDS_0040  2   shell   000
MT723943_CDS_0043   MT723943    34307   34519   -   NC_011056_CDS_0040  2   shell   000
MZ622157_CDS_0041   MZ622157    34709   35173   -   NC_011056_CDS_0040  2   shell   000
MZ622157_CDS_0042   MZ622157    35170   35382   -   NC_011056_CDS_0040  2   shell   000
MZ622158_CDS_0040   MZ622158    35181   35645   -   NC_011056_CDS_0040  2   shell   000
MZ622158_CDS_0043   MZ622158    36072   36284   -   NC_011056_CDS_0040  2   shell   000
NC_021305_CDS_0039  NC_021305   34167   34631   -   NC_011056_CDS_0040  2   shell   000
NC_021305_CDS_0042  NC_021305   35058   35270   -   NC_011056_CDS_0040  2   shell   000
NC_021311_CDS_0038  NC_021311   34214   34678   -   NC_011056_CDS_0040  2   shell   000
NC_021311_CDS_0041  NC_021311   35105   35317   -   NC_011056_CDS_0040  2   shell   000
NC_022969_CDS_0038  NC_022969   34216   34680   -   NC_011056_CDS_0040  2   shell   000
NC_022969_CDS_0041  NC_022969   35107   35319   -   NC_011056_CDS_0040  2   shell   000
NC_022981_CDS_0037  NC_022981   34172   34636   -   NC_011056_CDS_0040  2   shell   000
NC_022981_CDS_0040  NC_022981   35063   35275   -   NC_011056_CDS_0040  2   shell   000
NC_023689_CDS_0040  NC_023689   33685   34149   -   NC_011056_CDS_0040  2   shell   000
NC_023689_CDS_0043  NC_023689   34576   34788   -   NC_011056_CDS_0040  2   shell   000
NC_041846_CDS_0039  NC_041846   33383   33847   -   NC_011056_CDS_0040  2   shell   000
NC_041846_CDS_0042  NC_041846   34274   34486   -   NC_011056_CDS_0040  2   shell   000
OL742559_CDS_0041   OL742559    34195   34659   -   NC_011056_CDS_0040  2   shell   000
OL742559_CDS_0044   OL742559    35086   35298   -   NC_011056_CDS_0040  2   shell   000
ON970597_CDS_0041   ON970597    34195   34659   -   NC_011056_CDS_0040  2   shell   000
ON970597_CDS_0044   ON970597    35086   35298   -   NC_011056_CDS_0040  2   shell   000

Could this be because of a non-deterministic behavior in the defragmentation algorithm? See below the data and code to reproduce the issue. Just run it a couple of times and you should see changes in the output.

votu_sequences.tar.gz

awk -v OFS="\t" 'FNR==1 { print substr($1,2),FILENAME }' votu_sequences/* > votu_sequences.tsv

ppanggolin annotate -c 64 -o votu_sequences_ppanggolin --basename votu_sequences -p meta --fasta votu_sequences.tsv
ppanggolin cluster -c 64 -p votu_sequences_ppanggolin/votu_sequences.h5 --identity 0.8 --coverage 0.8
ppanggolin graph -c 64 -p votu_sequences_ppanggolin/votu_sequences.h5
ppanggolin partition -c 64 -K 3 --chunk_size 100000 -p votu_sequences_ppanggolin/votu_sequences.h5
ppanggolin write -o votu_sequences_ppanggolin -p votu_sequences_ppanggolin/votu_sequences.h5 --projection --force

awk 'FNR>1 && $7>1 {print $0}' votu_sequences_ppanggolin/projection/*

apcamargo commented 1 year ago

After investigating some more, I don't think this is due to the dereplication process itself. It seems that the cluster representative (which I got from the family column of the projection files) changes from execution to execution, even when I'm using --no_defrag. Maybe the way MMseqs2 picks a cluster representative is not deterministic?

axbazin commented 1 year ago

Hi,

The "defragmentation" algorithm is deterministic, given the same identical clustering it should provide the same results, if not it's a bug. The cluster representative though is indeed not, we've observed the same thing through time when using MMseqs2.

Normally, the ID should change between runs, but the genes clustered together should remain mostly clustered together.

Adelme

apcamargo commented 1 year ago

In the example above, when NC_011056_CDS_0040 was picked as the representative it caused a big change in the cluster composition, since several fragments aligned to it. I'm not really sure how common this effect is, though.

axbazin commented 1 year ago

Hopefully it should be marginal. When the cluster representative is longer or shorter, or has a slightly different AA composition and the fragment is very close to the identity threshold, it might change the final results, but otherwise it should be globally similar.

At some point we tried to solve the topic of having "standardized" cluster representatives so make sure the clustering remains as close as possible between multiple runs, like picking always the longest sequence, or build an "average" representative sequence based on a MSA of the family, but that was looking like a never-ending souce of troubles so we gave up on it.

apcamargo commented 1 year ago

I think you're right, cases like this are probably rare. That said, I managed to reduce the problem by adding the --single-step-clustering flag to the mmseqs cluster execution. This makes the clustering process slower, as it skips the linclust step, but helps preventing cases like that. Maybe this could be an option in ppanggolin cluster?

I suspect that using --mode 3 might alleviate the issue too, as it will pick the longest sequence as the representative.

apcamargo commented 1 year ago

I just found a fix for the inconsistent cluster representatives. MMseqs2 will provide deterministic cluster representatives as long as the input order doesn't change. In PPanGGOLiN, the order of the sequences in the gene FASTA file varies across different runs (I assume that's because of how Prodigal is parallelized). If we sort the FASTA file before mmseqs translatenucs the clustering representative will always be the same.

My patchy solution was to add a seqkit sort command in the first_clustering function:

def first_clustering(sequences: io.TextIO, tmpdir: tempfile.TemporaryDirectory, cpu: int = 1, code: int = 11,
                     coverage: float = 0.8, identity: float = 0.8, mode: int = 1) -> (str, str):
    """
    …
    """
    sorted_sequences = tmpdir.name + '/sorted_gene_sequences.fna'
    cmd = list(map(str, ["seqkit", "sort", "-n", sequences.name, "-o", sorted_sequences]))
    logging.getLogger().debug(" ".join(cmd))
    logging.getLogger().info("Sorting gene sequences...")
    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
    seq_nucdb = tmpdir.name + '/nucleotid_sequences_db'
    cmd = list(map(str, ["mmseqs", "createdb", sorted_sequences, seq_nucdb]))
…

A more elegant solution could be to turn Pangenome._geneGetter into a sorted dictionary. The sorting key could be the gene accession/name.

axbazin commented 1 year ago

Sounds pretty fixable to me, I'll take a look to add this. Not sure about making Pangenome._geneGetter a sorted dictionnary but I think I can make the order in which genes are written or read constant relatively easily, if that is all it needs.

apcamargo commented 1 year ago

Seems like this is a simple solution. I was thinking of something like sortedcontainers, but I don't think this is worth adding another dependency. I think you could just sort the CDS list in write_gene_sequences_from_annotations using the CDS name as the key.

apcamargo commented 1 year ago

Thanks for working on this, @axbazin and @jpjarnoux

Do you guys have any plans to release a new version soon? I saw that you are adding a ton of stuff to PPanGGOLiN and just wanted to know if a release containing fixes is scheduled.

axbazin commented 1 year ago

Hi, this was indeed fixed in the dev branch, but not yet released.

Unless there is change of plan the fix will get released with v2, there is no strict schedule yet but it's nearly complete, so likely in 2023.

apcamargo commented 1 year ago

Thanks for the info! Good to hear that the next release is almost ready :)

labgem / PPanGGOLiN

Non-deterministic clustering (possibly due to defragmentation) #116