rhysnewell / rosella

Metagenomic Binning Algorithm
BSD 3-Clause "New" or "Revised" License
38 stars 3 forks source link

UMAP and HDBSCAN alternatives #25

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Hello Rosella team,

I notice that hdbscan also replies on python but actually there is a beautiful rust implementation (https://github.com/petabi/petal-clustering) and is paralleled when necessary. I am wondering whether it can be used (petal-decomposition also provides PCA).

Thanks,

Jianshu

rhysnewell commented 2 years ago

Hi Jianshu,

This would be great to use within Rosella, however there is no UMAP rust implementation yet. The rosella algorithm heavily relies on UMAP and HDBSCAN working together back and forth, thus it is currently easier to just keep both the UMAP and HDBSCAN within the separate python component flight. If a rust UMAP implementation appears with a fully fleshed out API then it would be worth re-writing that part of Rosella, but until then we'll have to leave it as it is.

Cheers, Rhys

rhysnewell commented 1 year ago

I tried these and they were not good closing now

jianshu93 commented 1 year ago

Hello Rhys,

If you check the paper published recently (On UMAP’s True Loss Function, https://proceedings.neurips.cc/paper_files/paper/2021/file/2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf), you will find that UMAP loss function is kind of strange, which means it strengthens the repulsive force to allow more clustered visualization/clustering but actually loses high dimensional similarity in the embedded space, which is a key for clustering contigs. The improved UMAP section introduces a new loss function that considering nodes distances that are not linked in the k-graph (k-graph is the nearest neighbor search algorithm for finding closest neighbors of each point in the dataset, those not linked are represented by absence/0 similarity). We should evaluate how well UMAP preserves the original similarity before using it for clustering contigs.

Thanks

Jianshu

rhysnewell commented 1 year ago

Hi Jianshu,

Thanks for the paper, interesting read. Do you know if Leland McInnes is aware of this paper or any other maintainers of UMAP? Would be good if they added proper support for custom loss functions but looks like it is still in the works (https://umap-learn.readthedocs.io/en/latest/faq.html#can-i-add-a-custom-loss-function). I think this is fine, the UMAP algorithm is still relatively new and I had envisaged that there would be eventual changes/improvements to it. The first place those improvements would occur would be in the original library, so hopefully rosella improves with age but we shall see.

Cheers, Rhys

jianshu93 commented 1 year ago

Hello Rhys,

I think he knows but there are so many papers using UMAP (almost all singe cell RNA seq papers in the last 5 years) to infer cell types. It would be difficult to persuade them that the visualizations/clustering they saw is somehow an artifact of the high dimensional similarity. CoreSet seems to be even more interesting for clustering, which also breaks the O(n^2) complexity, check Daniel Baker's recently paper (https://dl.acm.org/doi/abs/10.1145/3459930.3469523). I am thinking whether we should apply this clustering idea for contig binning, e.g., in a kmeans++ case mentioned in the paper, or k-meidoid et.al..

Cheers,

Jianshu

rhysnewell commented 1 year ago

Thanks for the additional information, Jianshu.

Honestly, I'd be happy to depart from UMAP and HDBSCAN if strong alternatives exist. I'd be happy to investigate any alternatives you suggest as long they appear easy enough to implement/already have Python bindings.

I still think that UMAP and HDBSCAN perform quite well in the clustering department based on benchmarks but I'd love it if we could create an unsupervised binning algorithm on par with SemiBin2. Currently, Rosella falls just shy of SemiBin2 standards so cool algorithmic alternatives are very welcome. Happy to collab on this if you'd like

Cheers, Rhys

jianshu93 commented 11 months ago

Hello @rhysnewell ,

Sorry for the late response. It has been a busy semester for me. I feel like supervised binning is till not a good time. No matter what (deep) neural network is used the training data determines the final accuracy. However, with many genomes in public database (including GTDB, but this is a well maintained one) being questionable, I feel unsupervised binning is less subjected to database bias for now. I do not worry abundant MAGs in metagenomes, key is the less abundant(small coverage) ones like around 10X to 20X or below. See the attached 2 plots. one is binned through GraphMB (the cutting-edge assembly graph embedding + deep variational autoencoder) and another one is well know binning methods (metabat2, maxbin2, CONCOCT) + Das_tools refinement. I visualized them via t-SNE (umap will be available soon). As you can see, they are already very good, I believe they will be consistent with UMAP+hdbscan clustering for abundant ones (high coverage). The problem is still for low abundant ones. Check MAG in the second plot, to the left, green and pink MAG, they almost share contigs with very similar coverage and kmer composition distance. If UMAP separate them, then it will be very strange, because in the true/original kmer space they cannot be separated (if we use PCA for example), just because UMAP put more weight on the nearest neighbors, those contigs can be pulled apart. But this does not mean they are actually different from each other. We do not have enough information to tell. I read the SpaceMAP paper last month (https://proceedings.mlr.press/v162/zu22a.html), which makes a lot of sense when talking about the manifold structure and why UMAP will not work in some cases (continuous manifold). The underlying distortion of distance metric in UMAP (when compare original space with embedded space) has never been analytically expressed or validated. I feel this is basically similar to the "true loss function paper" I shared before, which suggest consider non-KNN graph edges in loss function, it is to consider the underlying distortion of distance metric.

That is when coverage is not high enough, the contig are short and contains less information to be accurately separated, then their neighbors play a role in UMAP, but all are short, low coverage contigs, we have no idea what their true kmer composition is so that we can separate them and no matter what lower dimension space we map them to. So I feel UMAP in those cases is just a statistical game, but I really love what you have done, I myself ran rosella several times and find is consistent with annoying 3 banner + refinement or graphMB for high coverage MAGS.

Thanks,

Jianshu

min17_gut_graphmb_tsne.pdf min17_mmgenome2_plot_tsne_legend.pdf