Non-deterministic clustering (possibly due to multi-threading)

ktmeaton commented 2 months ago

Similar to Issue #116, I've noticed that cluster results vary between runs even with input order sorting. It seems to be related to multi-threading in mmseqs, particularly the step that selects the representative sequence. However, if I restrict both ppanggolin and mmseqs to just one thread, I get 100% reproducibility across runs with deterministic clustering!

export MMSEQS_NUM_THREADS=1
ppanggolin cluster -p pangenome.h5 --cpu 1 --mode 2 --identity 0.90 --coverage 0.90
ppanggolin write_pangenome -p pangenome.h5 --families_tsv --output . --force

I found the variable MMSEQS_NUM_THREADS in the MMseqs2 user guide: https://mmseqs.com/latest/userguide.pdf
I verified exact cluster reproducibility with sha256sum gene_families.tsv across 10 runs.
I think the variable MMSEQS_NUM_THREADS is currently required, because the call to mmseqs result2repseq currently doesn't pass the cpu argument along, and so mmseqs uses all possible threads, which causes the inter-run variability: https://github.com/labgem/PPanGGOLiN/blob/d49dd5d5b808f822d54047928097d842423f0b72/ppanggolin/cluster/cluster.py#L98

This isn't really an 'issue' that needs to be fixed, I just wanted to document a possible solution for anyone else that encountered this. I'm testing how different parameters affect the clustering (ex. --identity 0.90) and wanted to control as many sources of random variation as I could.

jpjarnoux commented 2 months ago

Hi!

Thanks for pointing that out. I'm baffled about this because we added a check in the test dataset to make sure the number of clusters is as expected. I also remember a problem checking if gene families, representative genes, ... were in line with expectations. @jmainguy will confirm, but the locally calculated cluster differed from the GitHub action.

I'm going to try what you suggest. If the problem persists, we could open an issue on the MMSeqs2 repository and work with them to solve it.

Thanks again

jpjarnoux commented 2 months ago

Hi!

I found out that the clustering step was not the problem. The issue was in the writing step. When we write genes in the gene_families.tsv file, they are randomly ordered. I change the function to sort the gene families by size and gene by alphabetical order. This way, the clustering is the same between 2 runs.

You can check this PR #265

Also, I added a checksum to the GitHub action based on your suggestion.

Regards

jpjarnoux commented 2 months ago

Hi ! The fix has been included in PPanGGOLiN 2.1.1 Hope that will help you.

ktmeaton commented 2 months ago

Thank you for troubleshooting! I will test out the new release.

labgem / PPanGGOLiN

Non-deterministic clustering (possibly due to multi-threading) #263