sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
455 stars 78 forks source link

useful python API functions #2059

Open bluegenes opened 2 years ago

bluegenes commented 2 years ago

1. A function that reads a sourmash signature file/database and raises exceptions if it runs into issues

db = load_file_as_index(filename)

returns None if the file cannot be loaded.

db objects are Index classes and have the following two key methods:

2. A function that creates a signature, where we can provide some parameters such as scaling-factor, k-size, ...

Initialize a MinHash:

mh = sourmash.MinHash(0, ksize=..., scaled=...)

Add sequence to the minhash:

mh.add_sequence(seq) (DNA) or mh.add_protein(seq) (protein)

Build the minhash into a sourmash signature:

ss = SourmashSignature(mh, name=...)

3. A function that compares signatures and returns the results + 4. A function to approximate Jaccard results to ANI

Using the recently added FracMinHashComparison dataclass:

from sourmash.sketchcomparison import FracMinHashComparison

cmp = FracMinHashComparison(minhashA, minhashB, cmp_scaled=comparison_scaled_value)

jaccard = cmp.jaccard
average_containment = cmp.avg_containment
max_containment = cmp.max_containment

# estimate ani values
cmp.estimate_all_containment_ani()

# then get the values
avg_containment_ani = cmp.avg_containment_ani
max_containment_ani = cmp.max_containment_ani

Notes:

  • In some cases, the sketches may be likely to have no hashes in common due to chance alone. If this likelihood passes a threshold, we can consider the comparison to be a likely false negative. This is checked during ANI estimation and stored as cmp.potential_false_negative (boolean).
  • ANI value will be null when the sketch genome size estimation is inaccurate (e.g. very small genomes or too large of a scaled value)

Full API Reference: https://sourmash.readthedocs.io/en/latest/api.html

nmb85 commented 1 year ago

Has function 2 been added the the python API yet? I have a bare list of minhashes in a csv file that I'd like to convert into a signature object for easy db searches via the CLI interface. I would supply parameters for k-size, abund, etc.

ctb commented 1 year ago

Yes, should work!

ctb commented 1 year ago

(all of these functions are current)