Closed ktmeaton closed 2 months ago
Hi!
Thanks for pointing that out. I'm baffled about this because we added a check in the test dataset to make sure the number of clusters is as expected. I also remember a problem checking if gene families, representative genes, ... were in line with expectations. @jmainguy will confirm, but the locally calculated cluster differed from the GitHub action.
I'm going to try what you suggest. If the problem persists, we could open an issue on the MMSeqs2 repository and work with them to solve it.
Thanks again
Hi!
I found out that the clustering step was not the problem. The issue was in the writing step. When we write genes in the gene_families.tsv
file, they are randomly ordered. I change the function to sort the gene families by size and gene by alphabetical order. This way, the clustering is the same between 2 runs.
You can check this PR #265
Also, I added a checksum to the GitHub action based on your suggestion.
Regards
Hi ! The fix has been included in PPanGGOLiN 2.1.1 Hope that will help you.
Thank you for troubleshooting! I will test out the new release.
Similar to Issue #116, I've noticed that cluster results vary between runs even with input order sorting. It seems to be related to multi-threading in
mmseqs
, particularly the step that selects the representative sequence. However, if I restrict bothppanggolin
andmmseqs
to just one thread, I get 100% reproducibility across runs with deterministic clustering!MMSEQS_NUM_THREADS
in the MMseqs2 user guide: https://mmseqs.com/latest/userguide.pdfsha256sum gene_families.tsv
across 10 runs.MMSEQS_NUM_THREADS
is currently required, because the call tommseqs result2repseq
currently doesn't pass the cpu argument along, and sommseqs
uses all possible threads, which causes the inter-run variability: https://github.com/labgem/PPanGGOLiN/blob/d49dd5d5b808f822d54047928097d842423f0b72/ppanggolin/cluster/cluster.py#L98This isn't really an 'issue' that needs to be fixed, I just wanted to document a possible solution for anyone else that encountered this. I'm testing how different parameters affect the clustering (ex.
--identity 0.90
) and wanted to control as many sources of random variation as I could.