sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

make a `sourmash sketch fromfile` to support large scale sketching. #1671

Closed ctb closed 2 years ago

ctb commented 3 years ago

ref https://github.com/sourmash-bio/sourmash/issues/1652, you want to be able to sketch certain ksizes/moltypes for certain identifiers only, but on a large scale. how can we best do this?

ctb commented 3 years ago

this may actually be a real world use case for having the full genomes be available as "signatures" per https://github.com/sourmash-bio/sourmash/issues/1647

ctb commented 3 years ago

a much simpler version is to just have the 'nomatch' file output by mom-extract-sigs be a full manifest...

bluegenes commented 3 years ago

a much simpler version is to just have the 'nomatch' file output by mom-extract-sigs be a full manifest...

One way this could work if we enabled it in sourmash -

  1. New utility:sourmash sig check (with picklist functionality) = similar to extract, but don't actually extract/load any signatures, just check for the right signatures and output a missing manifest/picklist if they don't exist.

  2. add --picklist to sketch, to enable building just missing sigs (and ultimately/ideally adding them to an existing zip file :)

With picklist in sketch, sketch could also internally check for the sig existence before sketching. Seems handy, at least as an option, but not sure this should be default. But enabling check as a separate step allows us to get around any issues where the genome needs to be downloaded prior to sketching.

ctb commented 3 years ago

I like sourmash sig check or something like it!

Adding --picklist to sketch is much harder, because at least some of the picklist values (md5sum, for example) don't exist until after the signature is calculated. So we'd have to special case picklists to deal with that. Not impossible, just doesn't feel quite right.

Backing up a bit, the functionality we need looks something like this:

(1) given some input specification, one or more of -

(2) then inspect a manifest and figure out what remains to be calculated,

(3) and then go do it.


for (1), I think we can probably hack it together fairly easily.

for (2), there's some interesting interaction to work out, but it should be easy to say "you want X? X isn't in the manifest."

for (3), we need to have some way to connect identifiers to files, and to make things worse, we need to connect the same identifier to at least two different files - DNA and protein.


so for now it seems like there are several steps to work out 😄

bluegenes commented 2 years ago

Returning to this because I really would like a way to name sigs when building many at once, e.g. from a directory or a file list. I guess we could add something like --from-csv to sketch or sourmash sig rename, but enabling naming while sketching via picklist would be far cleaner.

Perhaps what we want is a different standard picklist format, e.g. sketchlist with standard column names for sketching. Columns I would envision allowing:

We could unite this with --from-file handling for sketching, if desired -- e.g. if the most basic type of sketchlist is a file with a single column of filenames, then just check for the right column names first, if none, fall back to current --from-file behavior). But maybe this overcomplicates.

While at it, I would imagine we would also want to add a --manifest output file option to sketch, so we could output all sketch info in a format that can be used as a picklist downstream. As an added bonus, this also would create a nice output file a snakemake sketching --sketchlist rule.

I guess something I'm implicitly stating here is that I care far less about being able to select things from a sketchlist - I would be happy to have individual lines for each sketch I'd like, even if that makes the file slightly less human-friendly (though param_string column would make that easier). What I think we really need is a way to specify sketches properly via this sort of file.

use case for ref: Building large databases from many signatures makes for slow and/or intractable snakemake DAGs. I can use file lists for sketching and zipping, but I can't think of a way to (batch) name the sigs properly via sourmash cli. Am I forgetting something obvious!?

edit: https://github.com/sourmash-bio/sourmash/issues/1315 could be a way to make this work

ctb commented 2 years ago

I haven't thought about this for a while, so apologies if this is obviously wrong - but, over in https://github.com/sourmash-bio/sourmash/issues/1647, I made the comment that

A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.

I wonder if we could some thing where we unify this idea with selection/sketching and do something where when FASTA sequences are stored in signatures or directories or zipfiles, we can calculate the MinHash sketches and output them somewhere.

At first glance it seems outside the current sourmash architecture but ...


separately, I worry about integrating large-batch sketch calculation into sourmash's Python implementation because it would not be parallelized (due to Python limitations).

bluegenes commented 2 years ago

We could make a base class SourmashSignatureInfo, or the like, which contains properties such as name, source/inputfile, destination/outputfile etc (all non-sketching components of SourmashSignature). This could be a great time/place to add dnafile and proteinfile as separate inputfile options, as we suggested wanting above. This would store everything for sketching except for the param string(s), which we could either also store (seems handy), or keep track of separately.

To support the use case you're thinking of, we could then have two classes that inherit from the base class -- the fasta sketchtype, and the FracMinHash sketchtype (our current SourmashSignature). Ofc this leaves the door open for additional sketchtypes.

For the use case I've been thinking of, the base class could be used to store info from --sketchlist csv input. We could build signature info for all desired sigs (from cli + sketchlist input), then build signatures. This would also enable the sig check --> building signatures utility discussed above.

separately, I worry about integrating large-batch sketch calculation into sourmash's Python implementation because it would not be parallelized (due to Python limitations).

two thoughts:

  1. What are the barriers to future parallelization in rust?
  2. Parallelization via snakemake batching makes this worthwhile, even if sourmash is working single threaded -- massive numbers of snakemake jobs are still not tractable.
ctb commented 2 years ago

This is not a response to the above yet, but it reminds me of an idle thought I had yesterday - what about a new command sourmash sketch fromfile that would support a custom format containing the necessary information - dna, protein, ksize, etc. - and this could be matched against a bunch of manifests before deciding which new signatures are created?

bluegenes commented 2 years ago

This is not a response to the above yet, but it reminds me of an idle thought I had yesterday - what about a new command sourmash sketch fromfile that would support a custom format containing the necessary information - dna, protein, ksize, etc. - and this could be matched against a bunch of manifests before deciding which new signatures are created?

this is the exact behavior I'd want, and I would happily use it in this format. It would be an added bonus to avoid needing to completely separate dna vs protein sketching.

ctb commented 2 years ago

this is the exact behavior I'd want, and I would happily use it in this format. It would be an added bonus to avoid needing to completely separate dna vs protein sketching.

excellent ;).

ctb commented 2 years ago

so, I poked around a little bit with this today, and came up with the following Python syntax:

sketch_spec1 = MinHashSpec(ksizes=[31], output_type="DNA")
sketch_spec2 = MinHashSpec(ksizes=[10], output_type="protein")

source1 = SketchFromFile(name="sketch name goes here",
                  protein_filename="data/some_prot_file.faa",
                  dna_filename="data/GCF_000005845.2_ASM584v2_genomic.fna.gz")

source2 = SketchFromRecords(map_record_to_name=fn_or_mapping,
                  protein_filename="some_big_file_of_proteins.faa")

missing = check_specs_against_manifests([sketch_spec1, sketch_spec2],
                                        [source1, source2], manifests)

The idea is that you build spec objects that detail the various sketch types you're interested in building, and then cross-product them with the data sources, to create a list of the actual sketches you want built. This can then be checked against manifests of existing signatures.

Questions:

  1. what does the missing object allow you to do here, in terms of actually generating the sketches? Some ideas (not mutually exclusive):
    • it could produce a set of command lines that could then be executed (with e.g. GNU parallel)
    • it could produce a set of output files that should be created (for snakemake compatibility)
    • it could support 'check only' mode where it fails/succeeds based on whether anything is missing (again, snakemake compatibility)
    • it could provide functions to actually, y'know, build the missing sketches
  2. if we support generating the sketches, what is the output save format for the sketches? we could just support all of the formats supported by sourmash CLI (that is, the .sig, .sig.gz, directory/, and .zip formats). Or just limit to .zip, which comes with built-in manifests.

Lots of additional thoughts too -

bluegenes commented 2 years ago

The idea is that you build spec objects that detail the various sketch types you're interested in building, and then cross-product them with the data sources, to create a list of the actual sketches you want built. This can then be checked against manifests of existing signatures.

I like it. This is definitely the separation we need!

what does the missing object allow you to do here, in terms of actually generating the sketches? Some ideas (not mutually exclusive): it could produce a set of command lines that could then be executed (with e.g. GNU parallel) it could produce a set of output files that should be created (for snakemake compatibility) it could support 'check only' mode where it fails/succeeds based on whether anything is missing (again, snakemake compatibility) it could provide functions to actually, y'know, build the missing sketches

if we support generating the sketches, what is the output save format for the sketches? we could just support all of the formats supported by sourmash CLI (that is, the .sig, .sig.gz, directory/, and .zip formats). Or just limit to .zip, which comes with built-in manifests.

I think it would be ok to only enable .zip. This functionality is most useful for large numbers of files, and at that level, zips are far more useful than individual sigfiles. Should we want to use sigs individually downstream, picklists enable easy extraction for outputting sigfiles or selecting on the fly (e.g. prior to search/gather).

Do we want to enable something separate like sig append that could append sigfiles (any format) to an existing zipfile without rewriting all existing sigs via sig cat? Seems like it might be useful in a few cases. I think this is a separate issue though :)

do we want to support declarative syntaxes like YAML for this, or is the Python code an OK place to start? (I chose to start with the Python code version of this because it was easy for me to write and then validate syntactically via execution (the above code parses just fine :). )

starting with python code seems good! Not terribly hard to add file consumption after the structure is in place, right?

Ultimately, for source (SketchFromFile, SketchFromRecords), csv input seems useful, since each name or ident might have dnafilename and proteinfilename. yaml could work too, just seems harder for many files if we link the dna and protein inputfiles.

For spec, yaml could be a good fit, since we're usually using the same spec for all sigs in a collection.

Here's an idea for format:

MinHashSpec:
  dna:
    - ksize: [21,31,51]
    - scaled: 1000
  protein:
    - input: [dna, protein]
    - ksize: 10
    - scaled: [100, 200]
  dayhoff:
    - ksize: 16

...with dna internally only allowing dna input and anything missing falling back to defaults. It's pretty easy to allow single input or lists within the yaml for ksize, input, and scaled. I think it's useful to let each alphabet be separate, since different ksize/scaled make sense. With this spec, we could execute only the alphabets that are specified.

The only thing I don't like is that if we use different formats (csv, yaml) for specification, then we can't output a single file for consumption with sketchfromfile. BUT, this minor annoyance is likely overshadowed by the fact that it is VERY handy to be able to easily edit a yaml to build a new alphabet/ksize sketch for many files.

luizirber commented 2 years ago

may I suggest toml instead of YAML? It would look something like this:

[[MinHashSpec.dna]]
ksize = [ 21, 31, 51 ]
scaled = 1_000

[[MinHashSpec.protein]]
input = [ "dna", "protein" ]
ksize = 10
scaled = [ 100, 200 ]

[[MinHashSpec.dayhoff]]
ksize = 16

Many places (including Python with pyproject.toml) are moving away from YAML because it has some confusing parsing issues (due to being underspecified).

ctb commented 2 years ago

this issue has come back to the forefront of my brain because of https://github.com/dib-lab/genome-grist/pull/130, where the construction of a private database is ...annoying because it's hard to properly name signatures in bulk.

One thing that I did in https://github.com/dib-lab/genome-grist/pull/130 that I kinda liked was to have the copy_private_genomes utility scan a pile of FASTA files and then output a CSV in a way that made it easy to edit and consume downstream. I wonder if we could take that approach here, in terms of (one way of) generating a list of FASTA sources for signatures?

ctb commented 2 years ago

A specific proposal -

I'm leaning towards implementing something like sourmash sketch fromfile that takes in a CSV file that is a list of sequence sources. Then we can do the standard param string thing (-p k=31 etc) with it. Crucially, this CSV file would have signature output names in it. It would be designed to have it be easy to construct with only a few lines of Python code, and we can provide some simple examples for us and others.

SEPARATELY, we would also add a --param-file or -P option (name is provisional :) to sourmash sketch commands that would take in the kind of TOML file that @luizirber roughs out above.

So that way we'd have params cross-product sources.

Finally, a separate piece of functionality would then be to enable sourmash sketch commands to take a list of databases (? probably better than manifests or picklists... but of course would use manifests/picklists underneath) that it would check against the names proposed in the fromfile above.

ctb commented 2 years ago

ugh, running into this YET AGAIN in some work on sketching PFAM. Need/could use a standard way to check which files have not been sketched yet.

🎶 motivation 🎶

ctb commented 2 years ago

getting started with the construction side of the input CSV file here: https://github.com/ctb/2022-sourmash-sketchfrom

ctb commented 2 years ago

This works and is not terribly hacky:

% ./fasta-to-source-list.py ncbi-assemblies/* -o xyz.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing genome 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
processing genome 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
---
wrote 1 entries to 'xyz.csv'

and it produces xyz.csv which looks like this:

ident,name,genome_filename,protein_filename
GCF_000018865,s__Chloroflexus aurantiacus,ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz,ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz

and (crucially) has correctly associated the protein faa.gz with the nucleotide fna.gz, based on the accession in the filename.

ctb commented 2 years ago

Do we want to enable something separate like sig append that could append sigfiles (any format) to an existing zipfile without rewriting all existing sigs via sig cat? Seems like it might be useful in a few cases. I think this is a separate issue though :)

as a side note - you can already do this with sourmash sig cat <new sigs> -o <existing zip file>.zip. No need for a sig append. This is not obvious or consistent behavior for -o, unfortunately - -o .sig will overwrite the existing file, and -o dirname/ will append...

ctb commented 2 years ago

Side thought: if we have a list of FASTA DNA and protein sequences together with names, we might need to have two such files if we build different names for GTDB and NCBI taxonomies.

I think a better solution will be to provide a separate mass-renaming function, perhaps via sig rename, which could also solve some other problems. Will think on it more.

ctb commented 2 years ago

renaming idea in: https://github.com/sourmash-bio/sourmash/issues/1883

ctb commented 2 years ago

progress!!

NOTE: uses https://github.com/ctb/2022-sourmash-sketchfrom, requires code in https://github.com/sourmash-bio/sourmash/pull/1884

round 0: build a CSV file with source genome/protein information

Starting from a directory ncbi-assemblies/, which contains:

% ls ncbi-assemblies/
GCF_000018865.1_ASM1886v1_genomic.fna.gz
GCF_000018865.1_ASM1886v1_protein.faa.gz

we construct a CSV file that automagically builds GTDB names:

% ./fasta-to-source-list.py ncbi-assemblies/* -o xyz.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
(new record for name 'GCF_000018865 s__Chloroflexus aurantiacus')
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz.csv'

The resulting file contains ident, name, and source files for sourmash sketch to use:

ident,full_ident,name,genome_filename,protein_filename
GCF_000018865,GCF_000018865.1,GCF_000018865 s__Chloroflexus aurantiacus,ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz,ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz

round 1: sketch some stuff

Now build the sketches:

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -o round1.zip
...
** Of 3 total requested in cross-product, skipped 0, built 3)

results :tada:

% sourmash sig summarize round1.zip 

** loading from 'round1.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/2022-sourmash-sketchfrom/round1.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 18468
summary of sketches:
   1 sketches with protein, k=10, scaled=200          6774 total hashes
   1 sketches with protein, k=9, scaled=200           6719 total hashes
   1 sketches with DNA, k=21, scaled=1000             4975 total hashes

round 2: try sketching the same stuff

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -o round2.zip --already-done round1.zip
Loaded 1 pre-existing names from manifest(s)
** Of 3 total requested in cross-product, skipped 3, built 0)

no soup for you!

round 3: try sketching the same stuff and some new stuff

Add -p k=31,k=51,dna:

% ./sketch_from.py xyz.csv -p k=10,protein -p k=9,protein -p k=21,dna -p k=31,k=51,dna -o round3.zip --already-done round1.zip
...
** Of 5 total requested in cross-product, skipped 3, built 2)

and voila, only the new stuff is there:

% sourmash sig summarize round3.zip

...

** loading from 'round3.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/2022-sourmash-sketchfrom/round3.zip
is database? yes
has manifest? yes
num signatures: 2
** examining manifest...
total hashes: 10210
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000             5050 total hashes
   1 sketches with DNA, k=51, scaled=1000             5160 total hashes
ctb commented 2 years ago

round 4 -

construct names from GTDB taxonomy:

% ./fasta-to-source-list.py ncbi-assemblies2/* -o xyz2.csv --names-from gtdb-rs202.taxonomy.v2.db --ident-from-filename
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz'
(new record for name 'GCA_903797575 s__Salmonella enterica')
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz2.csv'
% csvtk cut -f name xyz2.csv
name
GCA_903797575 s__Salmonella enterica

construct names from GenBank taxonomy:

% ./fasta-to-source-list.py ncbi-assemblies2/* -o xyz3.csv --names-from all_genbank_lineages.20200727.db --ident-from-filename
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz'
(new record for name 'GCA_903797575 Salmonella enterica')
processing file 'ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz3.csv'
% csvtk cut -f name xyz3.csv
name
GCA_903797575 Salmonella enterica

and these files can be used to build signatures with different names:

GTDB:

% ./sketch_from.py xyz2.csv -p k=31,dna -p k=10,protein -o xyz2.zip
Loaded 0 pre-existing names from manifest(s)
** Building 2 sketches for 2 files
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
calculated 1 signatures for 71 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
calculated 1 signatures for 4382 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
saved 2 signature(s) to 'xyz2.zip'. Note: signature license is CC0.
** Of 2 total requested in cross-product, skipped 0, built 2)
% sourmash sig describe xyz2.zip

signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz2.zip
signature: GCA_903797575 s__Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
md5: 2956485d23a2b6ba353b13e7ca5efeec
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4797
sum hashes: 4797
signature license: CC0

---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz2.zip
signature: GCA_903797575 s__Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
md5: 5589e961dd83929032ed1edbf96dc1d3
k=10 molecule=protein num=0 scaled=200 seed=42 track_abundance=0
size: 6460
sum hashes: 6460
signature license: CC0

loaded 2 signatures total, from 1 files

GenBank:

./sketch_from.py xyz3.csv -p k=31,dna -p k=10,protein -o xyz3.zip
Loaded 0 pre-existing names from manifest(s)
** Building 2 sketches for 2 files
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
calculated 1 signatures for 71 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
... reading sequences from ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
calculated 1 signatures for 4382 sequences in ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
saved 2 signature(s) to 'xyz3.zip'. Note: signature license is CC0.
** Of 2 total requested in cross-product, skipped 0, built 2)
% sourmash sig describe xyz3.zip
...
---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz3.zip
signature: GCA_903797575 Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz
md5: 2956485d23a2b6ba353b13e7ca5efeec
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4797
sum hashes: 4797
signature license: CC0

---
signature filename: /Users/t/dev/2022-sourmash-sketchfrom/xyz3.zip
signature: GCA_903797575 Salmonella enterica
source file: ncbi-assemblies2/GCA_903797575.1_PARATYPHIC668_protein.faa.gz
md5: 5589e961dd83929032ed1edbf96dc1d3
k=10 molecule=protein num=0 scaled=200 seed=42 track_abundance=0
size: 6460
sum hashes: 6460
signature license: CC0

loaded 2 signatures total, from 1 files
ctb commented 2 years ago

notes to self, collection/summary of the above comments and other thoughts

ctb commented 2 years ago

Implemented a draft sourmash sketch fromfile in https://github.com/sourmash-bio/sourmash/pull/1885

ctb commented 2 years ago

hmm, do we have an output format for signatures that we can write to from multiple different processes? definitely not .sig or .zip. maybe directory output? is there any possible contention there? maybe sqldb.

ctb commented 2 years ago

further note to self in re mass renaming: I suspect we could also match on SourmashSignature.filename in the sketchfrom codebase, i.e. infer that sketches that have the matching filename are probably the right thing, and then do the appropriate renaming.

ctb commented 2 years ago

hmm, do we have an output format for signatures that we can write to from multiple different processes? definitely not .sig or .zip. maybe directory output? is there any possible contention there? maybe sqldb.

we don't have anything. An effective if slightly ugly short-term solution here would be to use something like tempfile.NamedTemporaryFile to create a differently named output file for each thing.

OR, duh, something that we can totally do in this case because we control everything, is just construct our own set of uniquely named files.

Regardless of which solution we use, we will still be stuck reading from them to aggregate signatures for downstream purposes, rather than just building the ultimate end database in one glorious command. O well.

ctb commented 2 years ago

ok, so this is all lovely, but I think that I'm missing one of the key use cases we had in mind in the beginning: what about situations where you have a list of accessions (e.g. a listing of GenBank, or a GTDB release), and you want to build a comprehensive database, or check that you have the right set of genomes to build signatures for those accessions?

ctb commented 2 years ago

do we want to do anything about translate? not currently supported by plans.

ctb commented 2 years ago

latest update - bulk bulk bulk!

(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258407 entries from 'ntpierce-genomic.txt'
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processing file 'GCA_900543455.1_protein.faa.gz' (10978/516813)
processed 516813 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
all entries had matched genome and protein files!
36642 files had no content (zero size).
1 filenames yielded duplicate identifiers.
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.

and indeed there is one duplicate identifer, tsk tsk 😆

ntpierce-genomic.txt:/home/ntpierce/2021-rank-compare/genbank/genomes/GCF_001060105.1_ASM106010v1_genomic.fna.gz
ntpierce-genomic.txt:/home/ntpierce/2021-rank-compare/genbank/genomes/GCF_001060105.1_genomic.fna.gz
bluegenes commented 2 years ago

and indeed there is one duplicate identifer, tsk tsk 😆

Funny story! The original download of this file was empty - likely some download issue. I think the sketch may actually be empty in my gtdb databases, because we currently ignore empty files. @taylorreiter noticed it while running gtdb charcoal, so I manually re-downloaded the file and transferred, but initially forgot to change its name, as I was doing in the automated workflow. When she said the file was still empty, I retransferred and renamed, but forgot to check for the old one.

tl;dr - this file is here because we ignore empty files when sketching, and we should definitely do something differently ;) ref: https://github.com/sourmash-bio/sourmash/issues/1887

...and I shall go remove it now 💨

bluegenes commented 2 years ago

do we want to do anything about translate? not currently supported by plans.

I would like if we could support translate at some point, but two thoughts:

  1. we won't want to use translate for reference genome databases (aka our immediate use case; though I very much do want to use it on large-scale metagenome samples and/or MAG queries)
  2. Right now it's prohibitively slow on large datasets

Other translate thoughts: We could have a flag, --translate or --allow-translation that allows protein sketching to be done on DNA input files, if the matching protein file does not exist? Would only trigger when protein params are passed...

ctb commented 2 years ago

Better output - I fixed some reporting (but forgot to fix the duplicate entry, will do that now).

(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258407 entries from 'ntpierce-genomic.txt'
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processing file 'GCF_003798365.1_protein.faa.gz' (2046/516813)
processed 516813 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
36631 entries had only protein (and no genome) files.
11 entries had only genome (and no protein) files.
(missing files do not cause error exit without --strict)
36642 files had no content (zero size).
1 filenames yielded duplicate identifiers.
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.
bluegenes commented 2 years ago

36642 files had no content (zero size).

Most of these are entries that had no *faa.gz protein files (so I ran prodigal on the genomes instead). Forgot to give you the location for the prodigal proteomes: /home/ntpierce/2021-rank-compare/genbank/prodigal/.

Probably more useful, I have a file with ident,prodigal_filepath for all relevant proteomes here: /home/ntpierce/2021-rank-compare/gtdb-rs202.prodigal-filenames.csv (n=36,631).

ctb commented 2 years ago

my mighty script should be able to figure it all out, given only the path (and an appropriately named set of files!) I will let it loose! 🦁

bluegenes commented 2 years ago

my mighty script should be able to figure it all out, given only the path (and an appropriately named set of files!) I will let it loose! 🦁

same naming convention, just a different location. Let er rip! 🦖

ctb commented 2 years ago

we're down to the following errors - after eliminating zero size files, we have:

duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_000739575.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_003543635.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_002302865.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_009876525.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_013296485.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_000435395.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_900766585.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCA_002103955.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_000025685.1_protein.faa.gz
duplicate protein filename: /home/ntpierce/2021-rank-compare/genbank/proteomes/GCF_900759255.1_protein.faa.gz
missing genome file: GCF_002124775
missing genome file: GCF_005819405
missing genome file: GCA_009838785
missing genome file: GCF_000699565
missing genome file: GCF_002244155
missing genome file: GCF_900768585
missing genome file: GCF_000568735
missing genome file: GCF_000261045
missing genome file: GCF_001659225
missing genome file: GCF_003572455
missing genome file: GCF_001877985

the script is getting pretty good at this :). I haven't layered on the official list of identifiers yet, I'll try that.

ctb commented 2 years ago
(sourmash-dev) ctbrown@farm:~/scratch/fromfile$ 2022-sourmash-sketchfrom/genbank-to-fromfile.py -F ntpierce-prodigal.txt -F ntpierce-proteomes.txt -F ntpierce-genomic.txt -t gtdb-rs202.taxonomy.v2.db  -o gtdb-rs202-entire.csv --picklist gtdb-rs202.taxonomy.v2.db.ident.csv:ident:identprefix
Loaded 36641 entries from 'ntpierce-prodigal.txt'
Loaded 258406 entries from 'ntpierce-proteomes.txt'
Loaded 258406 entries from 'ntpierce-genomic.txt'
picking column 'ident' of type 'identprefix' from 'gtdb-rs202.taxonomy.v2.db.ident.csv'
loaded 258406 distinct values into picklist.
Any survivable errors will be reported to 'gtdb-rs202-entire.csv.error-report.txt'
processed 553453 files.
---
wrote 258406 entries to 'gtdb-rs202-entire.csv'
11 entries had only genome (and no protein) files.
(missing files do not cause error exit without --strict)
36642 files had no content (zero size).
10 filenames yielded duplicate identifiers.
for given picklist, found 258406 matches to 258406 distinct values
** Errors were encountered ;(. See details in 'gtdb-rs202-entire.csv.error-report.txt'.

key new message: for given picklist, found 258406 matches to 258406 distinct values

so except for the 11 entries with only genome (and no protein) files, we could now go forth and build 🎉

ctb commented 2 years ago

daaaaaaaamn

% sourmash sketch fromfile gtdb-rs202-entire.csv -p protein

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** 258406 total requested; built 258406, skipped 0

took about 3 seconds to run. (It just checked to see how many to build.)

ctb commented 2 years ago

interesting ~new challenge to consider: I'm looking at /group/ctbrowngrp/irber/data/wort-data/wort-genomes and /group/ctbrowngrp/irber/data/ncbi/genbank and thinking about how to properly deal with a pile of unindexed sigs.

this might be an opportunity to revisit the manifest-of-manifests thing, https://github.com/sourmash-bio/sourmash/issues/1685. The basic idea:

ctb commented 2 years ago

Confronted the horror of too many interconnected things, went for a nice hike, realized that maybe if we just could load manifest CSV files directly on the command line, we could build tooling around creating/maintaining such manifests for large wort collections without adding a whole lot of complexity to sourmash.

See https://github.com/sourmash-bio/sourmash/pull/1891 for code.

ctb commented 2 years ago

wort-genbank, all sigs, in a single manifest, using https://github.com/sourmash-bio/sourmash/pull/1891:

(sourmash-dev) ctbrown@c6-50:~/scratch/fromfile/split.irber-genome-sigs$ sourmash sig summarize entire2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'entire2.csv'

path filetype: StandaloneManifestIndex
location: entire2.csv
is database? no
has manifest? yes
num signatures: 4697901
** examining manifest...
total hashes: 55022206752
summary of sketches:
   1565967 sketches with DNA, k=21, scaled=1000, abund 17260335263 total hashes
   1565967 sketches with DNA, k=31, scaled=1000, abund 18364942000 total hashes
   1565967 sketches with DNA, k=51, scaled=1000, abund 19396929489 total hashes

Per /usr/bin/time -v, took about 1 minute, and ~6 GB of RAM:

        User time (seconds): 56.74
        System time (seconds): 3.74
        Percent of CPU this job got: 94%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03.78
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 6238316
ctb commented 2 years ago

ok, to recap:

there's some support code in https://github.com/ctb/2022-sourmash-sketchfrom that we don't need to integrate into sourmash (or may never need) -

Things we could maybe use at this point, beyond making those experimental PRs mergable:

Things we are punting on, for now; we'll need to make issues for these:

ctb commented 2 years ago

finally realizing that @bluegenes knew what she was talking about all along in #1365 and now https://github.com/sourmash-bio/sourmash/issues/1902, I implemented sig check to check picklists against existing databases in #1891. Combined with the direct manifest loading stuff, it's pretty nice!!

Here we are checking ~100 identifiers against a manifest for 4.7 million wort genbank sigs and pulling out the matching manifest entries:

% sourmash sig check --picklist bacteria.ident:ident:ident split.irber-genome-sigs/entire2.csv -o 1.csv --save-manifest 2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'bacteria.ident'
loaded 98 distinct values into picklist.
notify done
loaded 294 total signatures that matched ksize & molecule type
for given picklist, found 98 matches to 98 distinct values
wrote 294 matching manifest rows to '2.csv'

...which can then be directly summarized, searched, loaded, etc:

sourmash sig summarize 2.csv

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '2.csv'
path filetype: StandaloneManifestIndex
location: 2.csv
is database? yes
has manifest? yes
num signatures: 294
** examining manifest...
total hashes: 1752580
summary of sketches:
   98 sketches with DNA, k=21, scaled=1000, abund     581638 total hashes
   98 sketches with DNA, k=31, scaled=1000, abund     583825 total hashes
   98 sketches with DNA, k=51, scaled=1000, abund     587117 total hashes

(If there had been picklist values that weren't matched, they would have ended up in 1.csv)

ctb commented 2 years ago

building genbank stuff from assembly reports for archaea

Using code from https://github.com/sourmash-bio/sourmash/pull/1885,

download assembly_summary

curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt

construct identifier picklist

echo ident > archaea.assembly_summary.idents.csv
cut -f 1 assembly_summary.txt | grep -v ^# >> archaea.assembly_summary.idents.csv

run sig check

% sourmash sig check --picklist archaea.assembly_summary.idents.csv:ident:ident ../split.irber-genome-sigs/entire2.csv -o archaea.assembly_summary.missing.csv --save-manifest archaea.manifest.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'archaea.assembly_summary.idents.csv'
loaded 1210 distinct values into picklist.
notify done3
loaded 3627 signatures.
for given picklist, found 1209 matches to 1210 distinct values
WARNING: 1 missing picklist values.
saved 1 non-matching rows of 1210 picklist rows to 'archaea.assembly_summary.missing.csv'
wrote 3627 matching manifest rows to 'archaea.manifest.csv'

oh no, one is missing!

% more archaea.assembly_summary.missing.csv
ident
GCF_022489005.1

...but the rest are there:

% sourmash sig summarize archaea.manifest.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'archaea.manifest.csv'
path filetype: StandaloneManifestIndex
location: archaea.manifest.csv
is database? yes
has manifest? yes
num signatures: 3627
** examining manifest...
total hashes: 10739411
summary of sketches:
   1209 sketches with DNA, k=21, scaled=1000, abund   3564584 total hashes
   1209 sketches with DNA, k=31, scaled=1000, abund   3581871 total hashes
   1209 sketches with DNA, k=51, scaled=1000, abund   3592956 total hashes
ctb commented 2 years ago

building gtdb genomic for rs207

Looks like there might be a new GTDB release coming? See rs207 directory.

download files with identifiers

mkdir gtdb
cd gtdb
curl -L -O https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/bac120_taxonomy_r207.tsv.tsv.tar.gz
curl -L -O https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/ar53_taxonomy_r207.tsv.tar.gz

tar xzf ar53_taxonomy_r207.tsv.tar.gz 
tar xzf bac120_taxonomy_r207.tsv.tsv.tar.gz 
echo ident > gtdb-all-rs207.ident.csv
cut -f 1 *.tsv | cut -c 4- >> gtdb-all-rs207.ident.csv
wc gtdb-all-rs207.ident.csv

see how many signatures we can retrieve from wort

then run sig check:

% sourmash sig check --picklist gtdb-all-rs207.ident.csv:ident:ident ../split.irber-genome-sigs/entire2.csv -o missing-idents.csv --save-manifest matching-sigs.csv

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'gtdb-all-rs207.ident.csv'
loaded 317542 distinct values into picklist.
loaded 952626 signatures.
for given picklist, found 317542 matches to 317542 distinct values
(no remaining picklist entries; not saving to 'missing-idents.csv')
wrote 952626 matching manifest rows to 'matching-sigs.csv'

:tada: no missing signatures! wort rulez!

examine output manifest for ksizes etc.

Then examine the output manifest matching-sigs.csv:

% sourmash sig summarize matching-sigs.csv 

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'matching-sigs.csv'
path filetype: StandaloneManifestIndex
location: matching-sigs.csv
is database? yes
has manifest? yes
num signatures: 952626
** examining manifest...
total hashes: 3541080185
summary of sketches:
   317542 sketches with DNA, k=21, scaled=1000, abund 1180828455 total hashes
   317542 sketches with DNA, k=31, scaled=1000, abund 1178655672 total hashes
   317542 sketches with DNA, k=51, scaled=1000, abund 1181596058 total hashes

construct new GTDB release

for k=31:

% sourmash sig cat matching-sigs.csv -o gtdb-rs207.genomic.dna.k31.zip -k 31
ctb commented 2 years ago
(sourmash-dev) ctbrown@c4-68:~/scratch/fromfile/gtdb$ ls -lh *zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 17:04 gtdb-rs207.genomic.dna.k21.zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 14:40 gtdb-rs207.genomic.dna.k31.zip
-rw-rw-r-- 1 ctbrown ctbrowngrp 9.4G Mar 27 19:02 gtdb-rs207.genomic.dna.k51.zip
ctb commented 2 years ago

some documentation and examples for the fromfile stuff has been added here: https://github.com/ctb/2022-sourmash-sketchfrom/

ctb commented 2 years ago

some documentation and examples for the fromfile stuff has been added here: https://github.com/ctb/2022-sourmash-sketchfrom/

renamed to sourmash-bio/database-examples!

And I'm closing this issue now that https://github.com/sourmash-bio/sourmash/pull/1885 has been merged. 🎉