samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
657 stars 172 forks source link

Request for clearer documentation to explain bitwise FLAG values, ideally with documentation #55

Open kbradnam opened 10 years ago

kbradnam commented 10 years ago

If you are new to bioinformatics, and are asked to work with any SAM file, then you might reasonably turn to this documentation to help to better understand how the format works.

I feel that many people have trouble understanding what is meant by bitwise FLAG values. The documentation is very technical and not very transparent to people who may be new to bioinformatics.

Many people might be turning to the documentation after looking at their SAM output file. Maybe they see that their output file has a range of integer values in column 2 and are puzzled by the explanation in the documentation (this is very likely if you have no familiarity with bit patterns).

I think this section would be greatly helped by the following:

  1. A reminder that the SAM file itself stores an integer value (this is mentioned in the overview section for what all the columns mean, but it is not as obvious as it could be)
  2. An explicit description that a bitwise value of zero means that your read has mapped to the forward strand of the reference (t is not intuitive to work this out because the 'segment unmapped' bit has not been set)
  3. Some specific examples that explain what various integer values correspond to.
peterjc commented 10 years ago

I agree the SAM/BAM specification isn't novice-friendly, but maybe it doesn't need to be? It should be a developer centric dry technical document, but supplemented by separate user-facing documentation provided from people using SAM/BAM.

kbradnam commented 10 years ago

That would be acceptable, but you would still ideally want to direct people to those more user-friendly sources of documentation from the main SAM documentation.

Most people will come across the current SAM documentation from a Google search for 'SAM format'.

dbolser commented 10 years ago

There is space on SEQwiki for user created format information. On 25 Nov 2014 21:01, "Keith Bradnam" notifications@github.com wrote:

That would be acceptable, but you would still ideally want to direct people to those more user-friendly sources of documentation from the main SAM documentation.

Most people will come across the current SAM documentation from a Google search for 'SAM format'.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/55#issuecomment-64470004.

PeteHaitch commented 10 years ago

FWIW, I've always found http://broadinstitute.github.io/picard/explain-flags.html a very handy calculator for SAM flags.

LutzFr commented 10 years ago

The incorporation of these binary flags in an otherwise "readable" format let's me mischievously suspect that they were intended as an obstacle in the first place.

lh3 commented 10 years ago

Bit flag is a succinct way to encapsulate rich information. At the time of the first draft, it was not obvious how to represent multiple info in a readable style without greatly complicating the format. In the lack of an acceptable alternative, we kept the bit flag.

A few years later, I realized that we could use one character for each bit. This was the old samtools view -X output. In this representation, 99=0x63 becomes pP1R. It is more readable while maintaining a simple 1-to-1 translation to the bit flag. Nonetheless, the proposal was rejected by the consensus. Most considered this change too late as SAM was fairly mature.

LutzFr commented 10 years ago

Thanks Heng for the comment ! I just found today the, for me, so far best explanation of these flags and some tips on how to deal with them in python and perl scripts here: http://blog.nextgenetics.net/?e=18 I spend a long time searching for this info. For a simple minded biologist a simple letter code has its advantages.