Closed noamteyssier closed 6 months ago
I think SRA is closed source, unfortunately. Perhaps SRA tools team can clarify? I think it would have been great to publish a detailed description.
SRA is not closed source, it is actually public domain, it is developed in public here: https://github.com/ncbi/sra-tools and https://github.com/ncbi/ncbi-vdb. You need to understand both. There are C/C++/Java/Python bindings for the library. The physical "file-layout" is very complex, it is a compressed columnar data store with its own language. Don't try to access it at that level. Use the language bindings if you want to write your own tool. The best way to understand what is inside is using vdb-dump to explore the data layout - it is the same layout you can use from the language bindings.
Hi @wraetz is there information on the physical file-layout which you describe as "very complex"? Additionally, given a set of fastqs how is the file constructed? Of course I could try and parse this from various scripts on the repo but it would be helpful for me to understand the structure from a manual or man page.
The file layout is complicated and unimportant.
At a high level, it is a normalized database of genomics data organized into tables, with a consistent set of columns with a consistent set of datatypes, conforming to the INSDC SRA data model.
At a low level, it is an archive file.
And at levels in between, there are different abstractions.
We don't expect people who are accessing the data to need to deal with the lower level abstractions.
Hi @durbrow,
Referring back to my previous question, could you please point me to documentation that describes the lower level structure of the SRA archive file? This is important to a project I am currently working on.
Thank you! Sina B.
There is not such document. It isn't necessary for accessing the data, or even useful for that. You are welcome to examine the source code to see how it is written.
@sbooeshaghi is the physical layout necessary or would the logical layout be sufficient? Or perhaps it would make sense to discuss your project as much as you can to see what a good patch to connect the dots would be. We could converse by email if what you are working on is not ready to be in a public forum like this.
@sbooeshaghi : try the following cartoonish description of one of the flavors of SRA: the ones using compression by reference based on BAM input data https://ftp-trace.ncbi.nih.gov/sra/doc/csra-fileformat.ppsx It is more than 10 years old, but still valid enough to give you an idea of logical and physical layout. The majority of example commands should still work on any SRA format file, not only the ones produced from BAM
Hi, up-to-date documentation of the SRA file format will help diagnose multiple reported issues for the SRA file format and will help facilitate various enhancements related to file parsing. Here are a few examples:
https://github.com/ncbi/sra-tools/issues/452, The perennial problem of supporting gzip in fasterq-dump
https://github.com/ncbi/sra-tools/issues/794, Increasing speed of SRA data conversion ~8.4x
https://github.com/ncbi/sra-tools/issues/889, Potential data corruption for multiple single-cell assays
Are there plans to produce such documentation?
Thanks.
It appears that what you want to know is how to use the same APIs we use (from ncbi-vdb) in writing the tools. I would suggest you start with our python bindings. Here is an example. It almost certainly won't work as-is, but it should give you the idea.
Hello
I was just wondering if there was a detailed description of an *.sra file layout?
I am interested in experimenting with building a tool to extract sequencing records from these files but I can't find a good resource of what this file actually is or how the sequencing data is stored within it.
Apologies if this is obvious but would appreciate a link to a resource if one exists.
Cheers