k-mer extraction and dimensionality reduction

qiyunzhu commented 2 years ago

It would be interesting and useful to extract k-mers when reading assembly sequences (FASTA), and perform some simple dimensionality reduction (such as PCA) afterwards to build a meaningful scatter plot. Some infrastructures such as pairwise distance are already implemented. But a few more algorithms such as covariance matrix (np.cov) and eigen vector/value calculation (np.linalg.eig) are needed.

It is potentially computationally intensive. Memory consumption is also an issue. If k=5, there will be 4^5=1024 k-mers (features). One probably can't store all of them. Some reduction technologies such as Minimap or MinHash may be useful. @nujinuji @AbhinavChede

Note sure how useful this function is, since there are more dedicated dimensionality reduction tools for metagenomic binning. Needs @pavia27 's input on the market status.

kingtom2016 commented 2 years ago

I encounter this RAM problem in dimension reduction of k-mer info (6.7Gb kmer.tsv ), running on a PC with 128G RAM.

qiyunzhu commented 2 years ago

Hi @kingtom2016 Thanks for your interest! In this case you may consider 1) filter down dataset (e.g., removing contigs < 1000 bp), 2) use a smaller k size, 3) work on a computer with more memory, 4) downsample k-mers (i.e., randomly select 1000 columns out of 1M from kmer.tsv).

The question originally discussed in this issue is outdated. We planned to let BinaRena do dimensionality reduction from the GUI. That's no longer the plan. Instead, we provided Python scripts to do that outside the GUI.

qiyunlab / binarena

k-mer extraction and dimensionality reduction #42