labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
239 stars 28 forks source link

Non-deterministic clustering (possibly due to multi-threading) #263

Closed ktmeaton closed 2 months ago

ktmeaton commented 2 months ago

Similar to Issue #116, I've noticed that cluster results vary between runs even with input order sorting. It seems to be related to multi-threading in mmseqs, particularly the step that selects the representative sequence. However, if I restrict both ppanggolin and mmseqs to just one thread, I get 100% reproducibility across runs with deterministic clustering!

export MMSEQS_NUM_THREADS=1
ppanggolin cluster -p pangenome.h5 --cpu 1 --mode 2 --identity 0.90 --coverage 0.90
ppanggolin write_pangenome -p pangenome.h5 --families_tsv --output . --force

This isn't really an 'issue' that needs to be fixed, I just wanted to document a possible solution for anyone else that encountered this. I'm testing how different parameters affect the clustering (ex. --identity 0.90) and wanted to control as many sources of random variation as I could.

jpjarnoux commented 2 months ago

Hi!

Thanks for pointing that out. I'm baffled about this because we added a check in the test dataset to make sure the number of clusters is as expected. I also remember a problem checking if gene families, representative genes, ... were in line with expectations. @jmainguy will confirm, but the locally calculated cluster differed from the GitHub action.

I'm going to try what you suggest. If the problem persists, we could open an issue on the MMSeqs2 repository and work with them to solve it.

Thanks again

jpjarnoux commented 2 months ago

Hi!

I found out that the clustering step was not the problem. The issue was in the writing step. When we write genes in the gene_families.tsv file, they are randomly ordered. I change the function to sort the gene families by size and gene by alphabetical order. This way, the clustering is the same between 2 runs.

You can check this PR #265

Also, I added a checksum to the GitHub action based on your suggestion.

Regards

jpjarnoux commented 2 months ago

Hi ! The fix has been included in PPanGGOLiN 2.1.1 Hope that will help you.

ktmeaton commented 2 months ago

Thank you for troubleshooting! I will test out the new release.