sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

pairwise jaccard - distance - GPU #3016

Open kullrich opened 6 months ago

kullrich commented 6 months ago

Dear sourmash-bio team, this is a feature request. Unlike nt-genomes I am comparing protein based minhashes. Here, every single protein of one species and its corresponding minhash is compared to all proteins from a second species. The jaccard distance will than be used to do network-based clustering. Anyhow, the CPU based comparison is limited in parallelsation, so my feature request is related to transfer the pairwise jaccard distance calculation via CUDA on a GPU, to speed up species comparison. Could you provide this feature in sourmash?

Thank you in anticipation

Best regards

Kristian

Simple CPU serial code would like like this:

import screed
from sourmash import MinHash

species_a = 'species_a.fasta'
species_b = 'species_b.fasta'
K = [6, 7, 8, 9, 10, 11]
scale = [1, 10, 20, 100, 200]

def get_mh_array(fasta, K, scale):
  k_scale_dict = {}
  k_scale_dict['names'] = []
  for k in K:
    k_scale_dict[k] = {}
    for s in scale:
      k_scale_dict[k][s] = []
  for record in screed.open(fasta):
    k_scale_dict['names'] += [record.name.strip().split(' ')[0]]
    for k in K:
      for s in scale:
        mh = MinHash(n=0, ksize=k, is_protein=True, scaled=s)
        mh.add_protein(record.sequence)
        k_scale_dict[k][s] += [mh]
  return k_scale_dict

def pairwise_jaccard_from_array(a_names, a_mh_array, b_names, b_mh_array, outfile, min_jaccard=0.01, k='', s=''):
  with open(outfile, 'w') as outhandle:
    for i, i_mh in enumerate(a_mh_array):
      for j, j_mh in enumerate(b_mh_array):
        iname = a_names[i]
        jname = b_names[j]
        ij_jaccard = i_mh.jaccard(j_mh)
        if ij_jaccard > min_jaccard:
          outhandle.write(str(k)+'\t'+str(s)+'\t'+iname+'\t'+jname+'\t'+str(ij_jaccard)+'\n')

def pairwise_output(K, scale, a, b, ab_out):
  for k in K:
    for s in scale:
      pairwise_jaccard_from_array(a['names'],
        a[k][s],
        b['names'],
        b[k][s],
        ab_out+str(k)+'-'+str(s)+'.out',
        min_jaccard=0.01,
        k=str(k),
        s=str(s))

species_a_mh_array = get_mh_array(species_a, K, scale)
species_b_mh_array = get_mh_array(species_b, K, scale)

pairwise_output(K, scale, species_a_mh_array, species_b_mh_array, 'a-vs-b')
mr-eyes commented 6 months ago

This is not a detailed answer.

First, the serial code you provided could be easily parallelized with the Python concurrent or multiprocessing libraries. GPU is really not the solution here, and I would recommend using the sourmash CLI instead of the API, as the CLI provides the required functionalities.

However, in your case of pairwise comparisons, you might be interested in other solutions provided by sourmash.

  1. You can use sourmash sketch to sketch your data separately from processing. Then, use sourmash compare to compare your signatures, which supports parallelization. This will efficiently do the job for you.

  2. Use the sourmash plugin branchwater, where you can find the pairwise command that is more efficient than sourmash compare.

kullrich commented 6 months ago

Hi, thanks for the answer. I know about the multiprocessing library and I will give the branchwater plugin a try. Anyhow, since on a GPU one would have more than 1000 cuda cores to calculate the jaccard distances, with CPU one would only have up to 64/128 threads to use in parallel. Since the comparison is not only for two species but in best would be done on e.g. the ensembl vertebrate species (300 x 300 x 20.000 genes), in the end one would need to be as quick as diamond or last to be useful for the community.

ctb commented 6 months ago

hi @kullrich, a few more thoughts for you!

diamond and last do alignments, which might be more appropriate for distant homology/similarity analysis. It's not a use case we've explored specifically, since (at least for the moment) we're mostly focused on comparisons between longer sequences and/or higher similarity.

however, using k-mers has some distinct advantages over n**2 brute force comparisons in certain cases; in particular, you can use inverted indexes in situations with sparse similarities. For example, the branchwater plugin provides an inverted index that we have used to index TBs of metagenome data for realtime search - see branchwater.sourmash.bio for our demo case. @mr-eyes is in fact thinking about introducing that for pairwise, see https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/219.

extending sourmash to use CUDA is not something we've really thought about in any serious way, I'm afraid.

kullrich commented 6 months ago

Hi, to use pairwise kmer for species comparison is implemeted in pantools (https://pantools.readthedocs.io/en/stable/) , see publication here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2362-4

ctb commented 6 months ago

super cool, thanks for the reference!