lh3 / miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
MIT License
296 stars 68 forks source link

Explanation of GFA file format #71

Open nhartwic opened 5 years ago

nhartwic commented 5 years ago

I have two primary questions here...

  1. What information does the 'x' type line contain and where is it documented?
  2. What information does the SD tag, found in 'L' type lines contain? Example "SD:i:2198268"

This file contains information about your version of GFA but it doesn't actually bother to explain the x line in any kind of depth.

Just as an example for the type of answer I'm after, here is my current understanding of the x line...

x seg_name seg_len golden_path_count ? ? read_1:rstart-rend read_1_ori read_2:rstart-rend read_2_ori

Lingering questions here are...

  1. What do the 3 and 4 columns represent?
  2. Why are these two reads chosen to represent the untig?
rchikhi commented 4 years ago

According to the manual: An 'x' line gives a brief summary of each unitig, which can be inferred from S' anda' lines.

Regarding your lingering question 1, I'll attempt an answer, using the source code. The 'x' line may be shorter if the unitig is circular (p->start == UINT32_MAX). Otherwise, columns 3 and 4 likely indicate a number of in/out-going edges (asg_arc_n() doesn't have a comment but reading that part is helpful). Read1 and read2 are chosen in this function, and they're actually called 'start' and 'end'. It seems to me that they're reads at start/end extremities of the unitig.