ncbi / sra-tools

SRA Tools
Other
1.1k stars 242 forks source link

Getting some global information without downloading the file(s) #715

Open anachshon opened 2 years ago

anachshon commented 2 years ago

I wonder if there is an option (with a command line) to get some useful global information on an accession ID as how many mates we have (single / paired / technical) and how many reads we have, what is the range of read lengts, ...

Thanks, aharon.

stineaj commented 2 years ago

The sra-stat program can give basic information about the run contents. There is a --quick option that will output metadata stored in the SRR itself. sra-stat --xml --quick --archive-info SRR000001

Or you can get more in depth information if you ask sra-stat to read through the archive and generate additional information, this will take significantly longer however. sra-stat --xml --statistics SRR000001

These options will require you to have the archives downloaded. We have both content is eUtils as well as quite a bit of content on our website either in the Run Selector (best for comparing multiple runs) or the Run Browser https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR000001&display=metadata

You can also get quite a bit of information from the BigQuery or Athena if you are able to use those services.

anachshon commented 2 years ago

Thanks for the prompt detailed answer.

I am interested to get the number of reads and read length of any one of the mates.

With the command

sra-stat --xml -b 1 -e 2 SRR19410996

I can get the number of reads, the number of mates, and the total length of all mates, but not the length of each one of the mates. There is a way to get it (without of course downloading the data) ? If not, I think it will be nice to add it.

Thanks, aharon.