sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
465 stars 78 forks source link

Add human reference genome to prepared databases #2717

Open dportik opened 1 year ago

dportik commented 1 year ago

Hi Titus et al, Given the recent fiasco related to mapping reads to microbial databases without human references (links at bottom), it might be a good time to create a small human genome database for use with sourmash. A standalone database on the database page would be ideal, so that researchers can include with the other databases of interest.

Thanks for considering!

social media discussion: https://twitter.com/StevenSalzberg1/status/1686350449069244416 pre-print: https://doi.org/10.1101/2023.07.28.550993

luizirber commented 1 year ago

On the "raw" side [^1] there are both GRCh38.p14 and T2T-CHM13v2.0 signatures in wort, would that work?

[^1]: just downloaded the data and calculated a signature, no other pre-processing like repeat masking

dportik commented 1 year ago

Yep! Those should be plenty.

ctb commented 3 months ago

Repo to sketch hg38, including all unmapped chromosomes: https://github.com/ctb/2024-human-sketch

ctb commented 3 months ago

note: decontaminating human WGS samples, https://github.com/sourmash-bio/sourmash/issues/3151

ctb commented 3 months ago

download at: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/hg38/hg38-entire.sig.zip