ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

Description of SRA archive file layout #863

Closed noamteyssier closed 2 months ago

noamteyssier commented 8 months ago

Hello

I was just wondering if there was a detailed description of an *.sra file layout?

I am interested in experimenting with building a tool to extract sequencing records from these files but I can't find a good resource of what this file actually is or how the sequencing data is stored within it.

Apologies if this is obvious but would appreciate a link to a resource if one exists.

Cheers

apredeus commented 7 months ago

I think SRA is closed source, unfortunately. Perhaps SRA tools team can clarify? I think it would have been great to publish a detailed description.

wraetz commented 7 months ago

SRA is not closed source, it is actually public domain, it is developed in public here: https://github.com/ncbi/sra-tools and https://github.com/ncbi/ncbi-vdb. You need to understand both. There are C/C++/Java/Python bindings for the library. The physical "file-layout" is very complex, it is a compressed columnar data store with its own language. Don't try to access it at that level. Use the language bindings if you want to write your own tool. The best way to understand what is inside is using vdb-dump to explore the data layout - it is the same layout you can use from the language bindings.

sbooeshaghi commented 2 months ago

Hi @wraetz is there information on the physical file-layout which you describe as "very complex"? Additionally, given a set of fastqs how is the file constructed? Of course I could try and parse this from various scripts on the repo but it would be helpful for me to understand the structure from a manual or man page.

durbrow commented 2 months ago

The file layout is complicated and unimportant.

At a high level, it is a normalized database of genomics data organized into tables, with a consistent set of columns with a consistent set of datatypes, conforming to the INSDC SRA data model.

At a low level, it is an archive file.

And at levels in between, there are different abstractions.

We don't expect people who are accessing the data to need to deal with the lower level abstractions.

sbooeshaghi commented 2 months ago

Hi @durbrow,

Referring back to my previous question, could you please point me to documentation that describes the lower level structure of the SRA archive file? This is important to a project I am currently working on.

Thank you! Sina B.

durbrow commented 2 months ago

There is not such document. It isn't necessary for accessing the data, or even useful for that. You are welcome to examine the source code to see how it is written.

stineaj commented 2 months ago

@sbooeshaghi is the physical layout necessary or would the logical layout be sufficient? Or perhaps it would make sense to discuss your project as much as you can to see what a good patch to connect the dots would be. We could converse by email if what you are working on is not ready to be in a public forum like this.

yaschenk commented 2 months ago

@sbooeshaghi : try the following cartoonish description of one of the flavors of SRA: the ones using compression by reference based on BAM input data https://ftp-trace.ncbi.nih.gov/sra/doc/csra-fileformat.ppsx It is more than 10 years old, but still valid enough to give you an idea of logical and physical layout. The majority of example commands should still work on any SRA format file, not only the ones produced from BAM

sbooeshaghi commented 2 months ago

Hi, up-to-date documentation of the SRA file format will help diagnose multiple reported issues for the SRA file format and will help facilitate various enhancements related to file parsing. Here are a few examples:

https://github.com/ncbi/sra-tools/issues/452, The perennial problem of supporting gzip in fasterq-dump

https://github.com/ncbi/sra-tools/issues/794, Increasing speed of SRA data conversion ~8.4x

https://github.com/ncbi/sra-tools/issues/889, Potential data corruption for multiple single-cell assays

Are there plans to produce such documentation?

Thanks.

durbrow commented 2 months ago

452 is a feature request for how fasterq-dump formats its output. It has nothing to do with the .sra file format.

794 is a feature request to have an existing API perform like fasterq-dump does. It has nothing to do with the .sra file format.

889 is an issue with what data series are stored. It is a policy issue concerning the SRA data model. It has nothing to do with the .sra file format. This is the code repository for the SRA toolkit, and you are talking to the developers. We do not make policy.

It appears that what you want to know is how to use the same APIs we use (from ncbi-vdb) in writing the tools. I would suggest you start with our python bindings. Here is an example. It almost certainly won't work as-is, but it should give you the idea.