sourmash-bio / branchwater

Searching large collections of sequencing data with genome-scale queries
https://branchwater.sourmash.bio
Other
6 stars 2 forks source link

Build a k=31 SRA metagenomes index #24

Open luizirber opened 2 months ago

luizirber commented 2 months ago

I'll use this issue to document steps to build a k=31,scaled=1000 index for SRA metagenomes. This is the same process used for the current k=21,scaled=1000 index in branchwater.sourmash.bio, but considering the changes from #4, and bringing new SRA datasets added after the cutoff from the current index (2023-08-17).

luizirber commented 2 months ago

The index is built from pre-calculated signatures generated by https://wort.sourmash.bio, so first step is to pull all the signatures that are currently available[^1].

wort signatures are in an S3 bucket, sync to a local dir can be done with s5cmd:

s5cmd --stat sync --size-only 's3://wort-sra/*' wort-sra/

This was executed on a local dir (wort-sra/) that already had a partial copy[^2] of the bucket.

Operation       Total   Error   Success
cp              63524   0       63524
sync            1       0       1

This took 6h 17m 36s, mostly because listing the content of an S3 bucket is terribly slow. Should probably use the wort DB as source of truth of what sigs are ready, and not listing the S3 bucket... This will be easier after https://github.com/sourmash-bio/wort/issues/69

[^1]: could also get the runinfo first and download only a subset, like the example dataset does. [^2]: there are 4,207,322 SRA datasets in wort-sra/

luizirber commented 2 months ago

Select the metagenomes subset of the SRA datasets

The daily query for metagenomes in wort looks like this:

("{day_to_check}"[PDAT] : "{end_date}"[PDAT]) NOT amplicon[All Fields] AND "METAGENOMIC"[Source]

in order to grab all metagenomes up to a 2024-04-20, the query is[^2]

("1950/01/01"[PDAT] : "2024/04/20"[PDAT]) NOT amplicon[All Fields] AND "METAGENOMIC"[Source]

[^2]: did you know they did metagenome sequencing in the 50s? Totally true.

There are a couple of ways to execute this query, the easiest one is to go to https://www.ncbi.nlm.nih.gov/sra/ and paste the query, the click "search". In the results page, click in "send to", "choose destination: file", "format: Accession list", and then click "Create file".

image

wc -l in this file gives us 1,177,012 potential metagenomes.

Next we will use this file to figure out which wort signatures we have and calculate a manifest for them.


Clicking around with a browser is not very satisfying, because it is hard to automate. In this case it also goes way faster to download than using entrez-direct. For completeness, here is how to use entrez-direct to do get the same file.

All commands in this section were executed on an Ubuntu Linux container started with:

docker run -v $(pwd):/data -it ubuntu /bin/bash

Install dependencies for installing entrez-direct:

apt update
apt install curl perl

Install entrez-direct (official instructions):

sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
export PATH=${HOME}/edirect:${PATH}

Recommended: set up an NCBI API KEY to have higher limits

export NCBI_API_KEY=*****

Finally, create the accession list:

esearch -email **** -db sra -query '("1950/01/01"[PDAT] : "2024/04/20"[PDAT]) NOT amplicon[All Fields] AND "METAGENOMIC"[Source]' | efetch -format acclist -mode text > 2024-04-20.acclist

You may notice that this is only the accession list, and misses a lot of useful metadata. Previous versions of this index were calculate from a Run Info file, which has way more metadata, but makes it slower to download. As a result, when reaching ~800k entries in the Run Info file entrez-direct started failing[^1], so I went with the accession list instead, especially because the metadata for branchwater is pulled from bigquery and stored in mongo DB in a separate process/step.

For smaller subsets, you can get a runinfo file by using the -format runinfo flag for efetch in the above command, or by selecting "RunInfo" in the format field when saving results from the Web browser.

[^1]: likely because the WebEnv expired after a couple of hours?


Finally, a better solution is to use bigquery/athena to do the query, and also grab the metadata associated with the accessions. But that has potentially extra costs if going over the free tier limit for GCP/AWS. So :shrug:

luizirber commented 2 months ago

Creating a catalog for wort metagenomes in the accession list

In order to build an index we need a sourmash manifest describing the signatures. But before that we need to figure out what SRA metagenomes from the accession list are present in wort. So let's calculate the intersection between the acc list and the local copy of wort from the first step.

This is a modified snakemake rule from the previous time the index was built, and it will be in the PR associated with this issue:

rule catalog_metagenomes:                                                                                                                                                                                                                                                                                               
    output:                                                                                                                                                                                                                                                                                                             
        catalog="outputs/20240420-metagenomes-catalog",                                                                                                                                                                                                                                                                 
    input:                                                                                                                                                                                                                                                                                                              
        acclist="inputs/20240420.acclist",                                                                                                                                                                                                                                                                              
        basepath="/data/wort/wort-sra/"                                                                                                                                                                                                                                                                                 
    run:                                                                                                                                                                                                                                                                                                                
        import csv                                                                                                                                                                                                                                                                                                      
        from pathlib import Path                                                                                                                                                                                                                                                                                        

        # load all sra IDs                                                                                                                                                                                                                                                                                              
        sraids = set()                                                                                                                                                                                                                                                                                                  
        with open(input.acclist) as fp:                                                                                                                                                                                                                                                                                 
            data = csv.DictReader(fp, delimiter=",")                                                                                                                                                                                                                                                                    
            for dataset in data:                                                                                                                                                                                                                                                                                        
                if dataset['acc'] != 'acc':                                                                                                                                                                                                                                                                             
                    sraids.add(dataset['acc'])                                                                                                                                                                                                                                                                          

        path = Path(input.basepath)                                                                                                                                                                                                                                                                                     
        with open(output.catalog, 'w') as out:                                                                                                                                                                                                                                                                          
            # check if sraids exist on disk                                                                                                                                                                                                                                                                             
            for sra_id in sraids:                                                                                                                                                                                                                                                                                       
                sig_path = path / "sigs" / f"{sra_id}.sig"                                                                                                                                                                                                                                                              
                if sig_path.exists():                                                                                                                                                                                                                                                                                   
                    out.write(f"{sig_path}\n")                                                                                                                                                                                                                                                                          
                    out.flush()  

The outputs/20240420-metagenomes-catalog file has 1,061,158 entries, so we will be above 1M metagenomes this time =] Because this operates on HDDs, it took 2h 43m 28s to check for existence of these files.


I still have the catalog used last time, so we can make some comparisons!

There are 134,234 new datasets, compared to previous index.

$ comm -13 <(sort -u outputs/metagenomes-catalog) <(sort -u outputs/20240420-metagenomes-catalog)| wc -l
134234

And there are 3,300 datasets from the previous index that are NOT in the new catalog! :exploding_head:

$ comm -23 <(sort -u outputs/metagenomes-catalog) <(sort -u outputs/20240420-metagenomes-catalog)| wc -l
3300

This is something we observed before: metadata can change, and possible misclassifications (metagenome -> metatranscriptome, for example) and corrections will exclude datasets that were in the index from new versions. This also include retracted submissions [^1]. Examples:

and so on. Feel tempted to bring old ones to avoid disruption for people that did searches previously, and if you got an Amplicon result it is likely real (even tho we didn't validate 16s with scaled=1000).

[^1]: in wort the runinfo is downloaded daily, and saved to disk but not published anywhere. Retracted submissions are usually not available in metadata searches, but are still accessible by accession. This means that 1) wort has signatures for retracted datasets, and 2) there are potentially many more accessions that can be back-harvested from these daily files. What is the right direction to take? :shrug:

[^2]: should we keep amplicon in the future? When we first added metagenomes to wort we decided to exclude 16s, and so we do the NOT amplicon[All Fields] in the query. But they might be useful anyway, and better to add then? This would raise the number of datasets in the index to 5,404,234 :upside_down_face:

luizirber commented 2 months ago

Side-quest: cleaning up empty sigs in wort

With the catalog ready, next step is calculating a manifest.

But as soon as I started calculating the manifest, there were errors due to sourmash not being able to load the signature. :thinking:

This led to a side quest to investigate what was wrong with these sigs. Turns out they are empty:

$ zcat /data/wort/wort-sra/sigs/ERR12669542.sig

$ ls -lah /data/wort/wort-sra/sigs/ERR12669542.sig
-rw-r--r-- 1 xyz xyz 20 Mar  2 22:05 /data/wort/wort-sra/sigs/ERR12669542.sig

Wanted to check how many of those exist, and ran the following command[^1]:

$ find /data/wort/wort-sra/ -size -21c -type f > inputs/empty_wort

$ wc -l inputs/empty_wort
14190 inputs/empty_wort

yikes.

I checked a couple in S3 with

$ s5cmd ls 's3://wort-sra/sigs/ERR12669542.sig'
2024/02/25 20:03:09                20  sigs/ERR12669542.sig

and they are indeed empty.

Double yikes!

A defensive measure is to check for file size when compressing sigs in wort and only upload if they are not empty.

For now I excluded them from the catalog, so manifest calculation can proceed. Other follow-up tasks:

[^1]: took 2h 59m 24s. Cue comment to use a sqlite DB for these operations...

luizirber commented 2 months ago

Building a manifest

branchwater indices are built from a sourmash manifest [^1] containing some metadata from signatures. They usually look like this:

$ cat manifest
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
sigs/DRR000231.sig,c8afffa8d926eb4975a957783d8cf562,c8afffa8,21,dna,0,1000,74703,1,DRR000231,-

The rust crate under crates/index can build a manifest from a catalog, here is a snakemake rule to do that:

rule manifest_from_catalog_full:                                              
    output:                                          
        manifest="outputs/20240420-metagenomes-s1000.manifest",               
    input:    
        catalog="outputs/20240420-metagenomes-catalog",                       
        removed="inputs/20240420.removed",                             
    threads: 24,                                           
    shell: """                                                  
        export RAYON_NUM_THREADS={threads}       
        RUST_LOG=info {EXEC} manifest \                           
            --output {output} \           
            --basepath /data/wort/wort-sra \        
            <(cat {input.catalog} {input.removed})
    """ 

Which can be run with

$ snakemake -p -j24 outputs/20240420-metagenomes-s1000.manifest

and generate this command for execution:

export RAYON_NUM_THREADS=24
RUST_LOG=info cargo run -p branchwater-index --release --  manifest \
            --output outputs/20240420-metagenomes-s1000.manifest \
            --basepath /data/wort/wort-sra \
            <(cat outputs/20240420-metagenomes-catalog inputs/20240420.removed)

This took 29h 24m 58s to go thru all the SRA metagenomes signatures, and save a "full" manifest (k={21,31,51).


Two things to note above:

[^1]: which in this case is a CSV file, but could also be a sqlite database. More info