Closed tseemann closed 4 years ago
It would be awesome to get these assemblies!
I'm in contact with our SRA gurus, and I will get back to you soon.
The assemblies are publicly available for practically all Illumina samples of the above bacteria. It will improve in the future, but at this point the process of downloading is not particularly user friendly.
For accessing SKESA assembly in SRA for a single run, say SRR498276, one should first download file
http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign
The command line to extract fasta for assembly from above object is
dump-ref-fasta SRR498276_SRR498276.realign > SRR498276_skesa.fa
Source code and pre-compiled binaries for the SRA toolkit that contain dump-ref-fasta can be downloaded from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Those who want to download multiple assemblies could use a script. For example, the following script will download all available Salmonellae
#! /bin/sh
ORGN=salmonella
TOOLKIT= # / terminated path to toolkit or nothing if in the path
ACACHE= # / terminated path to where fasta assemblies should go. if nothing - dump locally.
# esearch and efetch are ncbi scripts around eutils. https://www.ncbi.nlm.nih.gov/books/NBK179288/
esearch -db sra -query "${ORGN}[orgn]" | efetch -db sra -format acclist |
while read acc ; do
rp=$( ${TOOLKIT}srapath -f names -r ${acc}.realign | awk '-F|' 'NF>8 && $(NF-1)==200 { print $8;}' ) ;
[ "$rp" = "" ] && continue; # skip unassembled runs.
${TOOLKIT}dump-ref-fasta "$rp" >${ACACHE}${acc}.assm.fasta;
done
To make this script work one has to download SRA toolkit and Entrez Direct scripts.
A few notes:
dump-ref-fasta http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276_skesa.fa
@yaschenk @souvorov @andrewjpage @lskatz
It sounds like SRR498276_SRR498276.realign
is an NCBI-style BAM+REF file?
Ok it seems SRR498276_SRR498276.realign
is 257 MB.
So I need to to download 257 MB to get a 3 MB FASTA reference file? I don't have the bandwidth from Australia for that.
Or can dump-ref-fasta
use random indexed access over HTTP ?
Also, I have sratoolkit 2.9.2
and dump-ref-fasta
is not in it?
The only reference to it in the github is in two shell scripts?
https://github.com/ncbi/sra-tools/search?q=dump-ref-fasta&unscoped_q=dump-ref-fasta
I do have the align-ingo
command:
% align-info SRR498276_SRR498276.realign
Contig_10_631.956,Contig_10_631.956,false,local
Contig_11_99.4693,Contig_11_99.4693,false,local
Contig_12_313.163,Contig_12_313.163,false,local
Contig_13_107.79,Contig_13_107.79,false,local
Contig_14_107.268,Contig_14_107.268,false,local
Contig_15_73.025,Contig_15_73.025,false,local
Contig_16_94.8808,Contig_16_94.8808,false,local
Contig_17_177.089,Contig_17_177.089,false,local
Contig_18_128.216,Contig_18_128.216,false,local
Contig_19_84.4582,Contig_19_84.4582,false,local
Contig_1_72.6479,Contig_1_72.6479,false,local
Contig_20_77.156,Contig_20_77.156,false,local
Contig_21_136.682,Contig_21_136.682,false,local
Contig_22_63.2869,Contig_22_63.2869,false,local
Contig_23_70.7443,Contig_23_70.7443,false,local
Contig_24_599.922,Contig_24_599.922,false,local
Contig_25_104.394,Contig_25_104.394,false,local
Contig_26_76.9974,Contig_26_76.9974,false,local
Contig_27_611.956,Contig_27_611.956,false,local
Contig_28_587.411,Contig_28_587.411,false,local
Contig_29_162.703,Contig_29_162.703,false,local
Contig_2_93.1265,Contig_2_93.1265,false,local
Contig_30_67.2612,Contig_30_67.2612,false,local
Contig_31_102.873,Contig_31_102.873,false,local
Contig_32_102.289,Contig_32_102.289,false,local
Contig_33_574.565,Contig_33_574.565,false,local
Contig_34_101.613,Contig_34_101.613,false,local
Contig_35_82.9806,Contig_35_82.9806,false,local
Contig_3_79.4608,Contig_3_79.4608,false,local
Contig_4_105.882,Contig_4_105.882,false,local
Contig_5_68.73,Contig_5_68.73,false,local
Contig_6_137.02,Contig_6_137.02,false,local
Contig_7_97.8996,Contig_7_97.8996,false,local
Contig_8_85.4217,Contig_8_85.4217,false,local
Contig_9_90.6756,Contig_9_90.6756,false,local
I'm also interested in easy ways to download the data. I had a go with the suggested method and it looks like it's only downloading the assembly, but I might be wrong.
I didn't have the SRA toolkit installed so I installed them fresh from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software. I think I found the code for dump-ref-fasta
in https://github.com/ncbi/ngs-tools but I wasn't 100% sure if it downloaded the full .realign
so I tried running it in a docker container.
root@0a3578e0dbc2:/code# time sratoolkit.2.9.2-ubuntu64/bin/dump-ref-fasta http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276_skesa.fa
real 0m7.387s
user 0m0.067s
sys 0m0.058s
root@0a3578e0dbc2:/code# ls -lh
-rw-r--r-- 1 root root 4.7M Oct 20 15:12 SRR498276_skesa.fa
drwxrwxr-x 5 1000 1000 4.0K Jul 23 21:34 sratoolkit.2.9.2-ubuntu64
In another window I ran:
ben@ben-july2017:~/projects/ncbi_assembly$ docker stats --no-stream
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0a3578e0dbc2 0.00% 3.043MiB / 15.55GiB 0.02% 1.58MB / 32.8kB 0B / 0B 1
I think this means that it only downloaded 1.58 MB of data which made 4.7 MB of SRR498276_skesa.fa
The only bit that I'm now not sure about is whether there are any md5 checksums anywhere so that I can check there weren't any issues during the download (or is that also part of dump-ref-fasta
)?
When SRA Tools are used directly on URL, they only download pages necessary to complete the task. dump-ref-fasta only needs contigs, so it only brings parts of SRA file containing 2-bit-per-base-compressed contigs. If you start running sam-dump or fastq-dump then you would see a lot more data downloaded. Read more about download-on-demand feature of SRA Toolkit at: https://github.com/ncbi/sra-tools/wiki/Download-On-Demand
About checksums. SRA format contains a lot of internal checksums to ensure data delivered correctly. All SRA Tools conduct necessary validation and generate error when something goes wrong. It SRA Tool completes the task with return code 0, then everything is correct.
Apologies for inconsistent packaging of dump-ref-fasta You can do the same with a different tools - vdb-dump Use 'vdb-dump -T REFERENCE -f fasta2' in place of 'dump-ref-fasta'
@yaschenk thank you so much! that worked perfectly and did not download the whole .realign
file.
$ time vdb-dump -T REFERENCE -f fasta2 http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276.REFERENCE.fna
real 0m9.888s
user 0m0.057s
sys 0m0.028s
$ echo $?
0
$ fa SRR498276.REFERENCE.fna
no=35 bp=4763249 ok=4763249 Ns=0 gaps=0 min=278 avg=136092 max=645633 N50=313057
$ head -n 4 SRR498276.REFERENCE.fna
>Contig_1_72.6479
AAAAAACGTGGCCTCCAGGACATCCTGATAGCCTGCGTGGACGGTCTTAAGGGCTTCCCGGATGCGATAA
ACAGCGTCTTCCCGCAGACCCATATCCAGCTGTGCATCATCCATATGGTGCGTAACAGCCTGAAATACGT
GTCCTGGAAGGACTACAAAGCCGTCACCAGCGGCTTGAAGACTGTCTATCAGGCCCCGACCAAAGAGGCG
I will need to study more the VDB container format and learn how to list the available containers etc.
Is there a way to do it with just theSRRnnnnnnnn
number and not knowning the URL?
(like fastq-dump
)
Or, how do i configure srapath
to find .realign
VDB files?
Not clear from here: https://github.com/ncbi/ncbi-vdb/wiki/Name-Resolution-Process
All name resolutions used by srapath are now done by SRA name server. If you read the script above in details it is already using srapath but in a bit awkward way. We are working on making SRA name server to resolve realign objects as painless as SRR accessions. This will become available in future releases of SRA Toolkit.
vdb-dump is the tool which you can use to explore SRA format as a list of tables and columns. It is documented here: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=vdb-dump
It may be a bit overwhelming though. For simple programmatic access, we have developed 'ngs' api https://github.com/ncbi/ngs which works in c++, java, and python. The best way to learn it is through examples. If you like python, this is a reference dumper in python: https://github.com/ncbi/ngs/blob/master/ngs-python/examples/DumpReferenceFASTA.py This is java version: https://github.com/ncbi/ngs/blob/master/ngs-java/examples/examples/DumpReferenceFASTA.java etc...
Hi,
I've noticed that most of these techniques no longer work. Perhaps this is because of the name server changes @yaschenk mentioned?
root@caf6559877b0:~# dump-ref-fasta http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276_skesa.fa
2019-02-04T13:43:48 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:43:48 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:43:49 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
Cannot open accession 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign', rc = RC(rcFS,rcDirectory,rcOpening,rcPath,rcInvalid)
I then tried changing the URL to https:
root@196911695bfd:~# dump-ref-fasta https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276_skesa.fa
2019-02-04T13:53:34 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:53:34 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:53:35 dump-ref-fasta.2.9.0 err: path invalid while opening directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
Cannot open accession 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign', rc = RC(rcFS,rcDirectory,rcOpening,rcPath,rcInvalid)
root@caf6559877b0:~# vdb-dump -T REFERENCE -f fasta2 http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276.REFERENCE.fna
2019-02-04T13:44:51 vdb-dump.2.9.2 err: path invalid while opening directory within file system module - error with https open 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:44:51 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - the path 'http://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign' cannot be opened as vdb-database or vdb-table
2019-02-04T13:44:51 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - Maybe it is a legacy table. If so, specify a schema with the -S option
root@196911695bfd:~# vdb-dump -T REFERENCE -f fasta2 https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276.REFERENCE.fna
2019-02-04T13:55:41 vdb-dump.2.9.2 err: path invalid while opening directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign'
2019-02-04T13:55:41 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - the path 'https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign' cannot be opened as vdb-database or vdb-table
2019-02-04T13:55:41 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - Maybe it is a legacy table. If so, specify a schema with the -S option
I thought the URL structure might have changed so I tried:
root@196911695bfd:~# srapath SRR498276
https://sra-download.ncbi.nlm.nih.gov/traces/sra5/SRR/000486/SRR498276
root@caf6559877b0:~# vdb-dump -T REFERENCE -f fasta2 $(srapath SRR498276) > SRR498276.REFERENCE.fna
2019-02-04T13:45:55 vdb-dump.2.9.2 err: path invalid while opening directory within file system module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra5/SRR/000486/SRR498276'
2019-02-04T13:45:55 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - the path 'https://sra-download.ncbi.nlm.nih.gov/traces/sra5/SRR/000486/SRR498276' cannot be opened as vdb-database or vdb-table
2019-02-04T13:45:55 vdb-dump.2.9.2 int: item not found while constructing within virtual database module - Maybe it is a legacy table. If so, specify a schema with the -S option
It looks like it's possible to download the whole file and extract the reference
root@196911695bfd:~# wget https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign
--2019-02-04 13:54:27-- https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 130.14.250.27, 130.14.250.24, 130.14.250.25, ...
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|130.14.250.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 269713117 (257M) [application/octet-stream]
Saving to: ‘SRR498276_SRR498276.realign’
SRR498276_SRR498276.realign 100%[================================================================================================================>] 257.22M 22.2MB/s in 15s
2019-02-04 13:54:43 (17.6 MB/s) - ‘SRR498276_SRR498276.realign’ saved [269713117/269713117]
root@196911695bfd:~# ls -lh SRR498276_SRR498276.realign
-rw-r--r-- 1 root root 258M Jun 1 2017 SRR498276_SRR498276.realign
root@196911695bfd:~# dump-ref-fasta SRR498276_SRR498276.realign > SRR498276_skesa.fa
root@196911695bfd:~# ls -lh SRR498276_skesa.fa
-rw-r--r-- 1 root root 4.7M Feb 4 13:58 SRR498276_skesa.fa
root@196911695bfd:~# head SRR498276_skesa.fa
>Contig_1_72.6479:1.81815
AAAAAACGTGGCCTCCAGGACATCCTGATAGCCTGCGTGGACGGTCTTAAGGGCTTCCCGGATGCGATAA
ACAGCGTCTTCCCGCAGACCCATATCCAGCTGTGCATCATCCATATGGTGCGTAACAGCCTGAAATACGT
GTCCTGGAAGGACTACAAAGCCGTCACCAGCGGCTTGAAGACTGTCTATCAGGCCCCGACCAAAGAGGCG
GCACTGATGGCGCTCGATGCGTTCGCGGAAGTCTGGGACGATAAATATCCGCAAATCAGCAAAAGCTGGC
GTGCGCACTGGGAAAACCTCAATACGCTCTTCAGTTATCCGCCGGATATCCGCAAGGCCATCTACACCAC
GAACGAAATCGAATCGCTTAACAGTGTGATCCGGGCTGCGATTAAGAAACGCAAAGTGTTCCCGACGGAT
GACTCGGTACGAAAGGTTATTTATCTGGCGATCAAAGGTGCGTCGAAAAAATGGAGTATGCCGATCCAGA
ACTGGCGGTTAGCAATGAGCCGTTTTATTATCGAGTTCGGTGACCGCCTGAGCGATCACCTTTAATACAT
TGGCAGTTACACAGAATTACTGACAGGCTCAAAAAGTGTGGTGGTAAATGGCCTGGCTGTTCAGGAACGA
I think I am using the latest SRA toolkit:
root@561935ac4ca2:~# dump-ref-fasta --version
dump-ref-fasta : 2.9.0 ( 2.9.2 )
root@561935ac4ca2:~# vdb-dump --version
vdb-dump : 2.9.2
Many thanks,
Ben
Ben, You seem to have some active firewall-ing and proxy-ing system. The fact that wget works but vdb-dump fails suggests that either http ranges (used by SRA Toolkit) are being blocked of some other deep packet inspection is going on. You may ask your IT whether anything has changes recently or you may contact SRA Toolkit (sra-tools@ncbi.nlm.nih.gov) with your problem of accessing regular SRR object. They have a nice set of tools and operating procedures to narrow down the issue.
As for name resolver, it is coded, but not released. we are working on preparing the package.
Thanks @yaschenk I don't seem to be able to reproduce the issue from other machines so I think it's a local environment problem.
How would I verify if an assembly does or does not exist? For example, SRR5424152 did not return an assembly when I used srapath -f names -r
as described. A majority of SRA run accessions are not returning results which is why I think something is wrong on my end.
And this is my perl code to mimic your shell code
sub downloadSkesaAsm{
my($SRR) = @_;
my $asm = "$tempdir/$SRR.fasta";
my $URL="";
open(my $fh, "srapath -f names -r $SRR.realign | ") or die "ERROR: could not run
srapath -f names -r $SRR.realign: $!";
while(<$fh>){
chomp;
my $eighthField = (split(/\|/,$_))[7] || "";
if($eighthField =~ m@(/traces/|^http)@ ){
$URL = $eighthField;
print STDERR "debug: $1 $URL\n";
}
}
close $fh;
if(!$URL){
print STDERR "ERROR: could not find URL for asm for $SRR\n";
return "";
}
system("dump-ref-fasta $URL > $asm");
die "ERROR with dump-ref-fasta $URL ($SRR): $!" if $?;
return $asm;
}
SRA is currently building assemblies only within a "Pathogen Detection Program". Which means only human bacterial pathogens are included. The exact list of organisms you can find following the link, It also matches our confidence in SKESA as an assembler for cultured bacterial WGSs
Pathogen Project is a very small fraction of SRA. Your example SRR5424152 is Douglas Fir.
We do plan to expand "assembly building" in SRA. It does take time since SRA is very heterogeneous.
Is there a standard way to download many files at once? To batch them? I was trying to think of some kind of method where you
@yaschenk i notice that most of the *.metadata.amr.tsv
files in the PDP now refer to GCA_xxxxxxxxxx
assemblies instead of SRR
numbers. Does this mean all those SKESA assemblies are now in Genbank? I count ~240,000 unique GCA_
accessions.
Yes, Pathogen Pipeline is now generating submissions to Genbank and automatically annotating them for submitters who requested this service. The assembly may slightly differ since it targets AMR regions.
@yaschenk can you describe how to use SKESA to target or guide assembly around particular reference genes of importance?
@tseemann, as far as SKESA is concerned, the only way to target it is to preselect reads close to the gene of interest. On the other hand, we have a different assembler which is close to get published (SAUTE -Sequence assembly using target evidence). It does exactly what you asked about. If you have a particular set of references and reads I'll be glad to try it.
Ok, i thought "SAUTE" was a feature being added to SKESA but I now understand it is a separate tool. We look forward to it!
Hi,
The dump-ref-fasta technique is not working for me for quite a while. And wget is now returning a 500 internal server error. I wonder if there is new name resolver or it is a temporary server issue.
Thanks, Tongzhou
The dump-ref-fasta technique is not working for me for quite a while. And wget is now returning a 500 internal server error. I wonder if there is new name resolver or it is a temporary server issue.
I am able to reproduce an error with dump-ref-fasta in testing. We will take a look at the problem, In the interim I was able to use
vdb-dump -T REFERENCE -f fasta2 SRR498276 > SRR498276.REFERENCE.fna
to output the reference for this run.
@stineaj Thanks! I used
vdb-dump -T REFERENCE -f fasta2 SRR498276 > SRR498276.REFERENCE.fna
and got a large file which I suppose is not the assembly but a full realign file.
So I tried
vdb-dump -T REFERENCE -f fasta2 https://sra-download.ncbi.nlm.nih.gov/srapub_files/SRR498276_SRR498276.realign > SRR498276.REFERENCE.fna
as discussed above and got segmentation fault.
Tongzhou
I thought I would contribute here since I was using the methods outlined here but they did stop working. I emailed the SRA help desk and got an answer on downloading the assemblies. Below is their response:
The location of the realign objects can change over time. This one now has a new location:
https://sra-download-nfs.be-md.ncbi.nlm.nih.gov/traces/sra48/SRZ/000498/SRR498276/SRR498276.realign
So a command like this should work:
vdb-dump -T REFERENCE -f fasta2 https://sra-download-nfs.be-md.ncbi.nlm.nih.gov/traces/sra48/SRZ/000498/SRR498276/SRR498276.realign
Please note that you may need to uncheck 'enable remote access' on the first tab of the configuration utility, and save the config in some cases if it is not working.
You can get the current location for a realign file using our SDL service and changing out the run accession or giving a list of comma delimited accessions:
I tested this and it worked for me, so hopefully, this helps others
As of March 2018, SKESA had been used by NCBI to assemble over 272,000 read sets available in SRA including assemblies for Salmonella (131,581 assemblies), Listeria (19,718 assemblies), Escherichia (65,307 assemblies), Shigella (10,942 assemblies), Campylobacter (32,416 assemblies), and Clostridioides (12,042 assemblies). These species are of importance for detecting pathogens in the food supply chain and in hospitals. Assemblies are publicly available in a downloadable object for each read set from the SRA website.
I am a bit unclear on where can we download these assemblies. Can you give an example?
I looked without luck for more info here ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/skesa/datasets