iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

SAM file discussion #300

Open leoisl opened 1 year ago

leoisl commented 1 year ago

Intro

In the racon_denovo_p1 branch, we are producing a pseudo SAM file. The pseudo comes from some fields breaking them SAM format. This issue is to discuss the SAM file we produce now, and if there are fields we should add/change/remove. It would be nice to have input from you @iqbal-lab , @mbhall88 , @Danderson123 . This is a snippet of the current SAM we produce:

@PG ID:pandora  PN:pandora  VN:0.9.2
@CO LF: left flank sequence, the sequence before the first mapped kmer, soft-clipped
@CO RF: right flank sequence, the sequence after the last mapped kmer, soft-clipped
@CO PPCH: Prg Paths of the Cluster of Hits: the PRG path of each hit in considered cluster of hits
@CO NM: Total number of mismatches in the quasi-alignment
@CO AS: Alignment score (number of matches)
@CO nn: Number of ambiguous bases in the quasi-alignment
@CO cm: Number of minimizers in the quasi-alignment
simulated_read_0[5:97]  0   GC00010897  1{[343, 358)}   255 5S92=3S *   0   0   TGGCACGGCATGGGGGAGGTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGC    *   LF:Z:CGATC  RF:Z:TGA    PPCH:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}-> NM:i:0  AS:i:92 nn:i:92 cm:i:12
simulated_read_1[11:97] 0   GC00006032  3{[161, 169)[172, 173)[180, 186)}   255 11S86=3S    *   0   0   TGGCTAATCACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCACCAAAATGAAACACTGCGAC  *   LF:Z:TTCCGCCTCCC    RF:Z:ATT    PPCH:Z:3{[161, 169)[172, 173)[180, 186)}->2{[172, 173)[180, 194)}->1{[186, 201)}->1{[200, 215)}->1{[214, 229)}->1{[218, 233)}->1{[222, 237)}->3{[229, 237)[240, 241)[249, 255)}->1{[250, 265)}->2{[253, 267)[271, 272)}->   NM:i:0  AS:i:86 nn:i:86 cm:i:10
simulated_read_2[8:87]  0   GC00006032  1{[93, 108)}    255 8S79=13S    *   0   0   AGGCAAAGAGAGATGATCTCCTTGTATGTCTGTATTAGCATGATGTTGCTTGTTAATAATTAAAAATATCAACGCGCTT *   LF:Z:TAAAAAAC   RF:Z:ATATAAGCGCGGG  PPCH:Z:1{[93, 108)}->1{[86, 101)}->1{[82, 97)}->1{[79, 94)}->1{[67, 82)}->1{[66, 81)}->1{[61, 76)}->1{[58, 73)}->1{[52, 67)}->1{[43, 58)}->1{[29, 44)}->    NM:i:0  AS:i:79 nn:i:79 cm:i:11
simulated_read_3[5:95]  0   GC00006032  1{[200, 215)}   255 5S90=5S *   0   0   AATATTGTGATGTGCTCCATAAATGCCAATGTGGTGATTAGCCAGGGAGGCGGAAAATGTGTTTACACTCCTGAAATAATAAAAAACAGG  *   LF:Z:ATTGA  RF:Z:CAAAG  PPCH:Z:1{[200, 215)}->1{[186, 201)}->2{[172, 173)[180, 194)}->3{[161, 169)[172, 173)[180, 186)}->3{[158, 169)[172, 173)[180, 183)}->3{[144, 145)[152, 153)[156, 169)}->3{[137, 145)[152, 153)[156, 162)}->1{[129, 144)}->1{[125, 140)}->1{[115, 130)}->1{[105, 120)}->  NM:i:0  AS:i:90 nn:i:90 cm:i:11

We first have some header lines saying that pandora created this SAM file, and some explanations about the SAM extra fields we add. In particular, LF, RF and PPCH are pandora-specific SAM extra fields, and can get quite large - I think we will by default remove these, but we can add them back with a pandora --extra-SAM-fields flag, as they can be quite useful for debugging. The other SAM extra fields, NM, AS, nn, and cm are some of the standard ones we can produce - minimap2 also outputs these.

In summary, pandora matches reads and PRG minimisers, then filters these mappings, obtaining clusters of hits. Each cluster of hits can be thought of as a region of the read that maps to a single PRG. Each such cluster of hits -> PRG pair becomes a SAM record in the SAM file. Therefore, a single read can map to several PRGs and have several SAM records (this is expected for long reads, for example).

SAM fields details

Some comments about the SAM fields: Field 1. QNAME, query name: we include the query name, and in brackets the specific region of the query we quasimap. This is intentionally added as pandora expects to map different regions of a read to different floating loci. This is one crucial difference WRT linear mappers, and as such I think it needs to be well highlighted and put in the query name; Field 2. FLAG: right now the only flags we set are 0x0 (i.e. read mapped) and 0x4 (i.e. read did not map). @Danderson123 is working on setting flag 0x10: SEQ being reverse complemented, so we know the strand of the quasi-mapping (this is not as trivial as it looks like). As said previously, we can consider our query as being composed of several segments, mapping to potentially different loci. We could thus set all these other flags, which I currently simply ignore, as I have never used any in SAM post-filterings:

0x1: template having multiple segments in sequencing
0x2: each segment properly aligned according to the aligner
0x8: next segment in the template unmapped
0x20: SEQ of the next segment in the template being reverse complemented
0x40: the first segment in the template
0x80: the last segment in the template

These fields could be useful to infer for e.g. the order of genes a read maps to more easily, but it can also be inferred from the region described in QNAME. Should we set these flags?

Besides these flags, these two are currently ignored because we choose the best cluster of hits for each region of the read, and ignore secondary/supplementary clusters of hits:

0x100 secondary alignment
0x800 supplementary alignment

And these last two are ignored because they are not applicable:

0x200 not passing filters, such as platform/vendor quality controls
0x400 PCR or optical duplicate
  1. RNAME: this is just the PRG name the region of the query mapped to;

  2. POS: this is very different from the standard SAM file format, as here we describe the PRG path of the first kmer hit, instead of the position on the linear reference the alignment starts. However, the abstract concept this fields represents is the same: it is where the mapping starts;

  3. MAPQ: this is just being ignored for now, we always output 255, which means not available;

  4. CIGAR: we output a simple CIGAR, composed of soft clipping (S, the region of the read before the first mapped kmer and after the last mapped kmer), equal bases (=, when there is a minimiser kmer hit covering the region), mismatch (X, when there is no minimiser kmer hit covering the region). This is an example of a CIGAR we produce: 5S52=10X30=3S: we had the first kmer matching in the 6th kmer (first 5 bases are soft clipped - 5S), then for 52 bases we had exact matches, then we missed 10 bases, then 30 bases of exact matches and the last 3 bases we did not hit (3S). There is not much to improve with just quasi-mapping here. We could improve on the soft-clipped regions (S) and on the mismatched regions (X), but for that we would need proper alignment, which we are not doing.

  5. SEQ: we output the sequence of the region we quasi-mapped. This is just QNAME in sequence itself;

  6. RNEXT, PNEXT, TLEN, QUAL: we output for these either * or 0, which means not available. Please tell me if you think we should output any of these fields properly;

  7. Extra fields: we output the following extra SAM fields, explained in the header of the SAM file: LF, RF, PPCH, NM, AS, nn, cm.

Conclusion

In general, I am OK with what we are producing right now, I think is very useful for future pandora analyses. I can list two TODOs:

  1. Finish implementing flag 0x10;
  2. Decide if we want to be compliant with the SAM file format.
    • The utility of complying to the standard is that several tools would then be able to process pandora SAM files. For example right now we can't even run samtools to filter/sort/etc the SAM file, as it errors out. I can see two things that has to be changed to comply to the standard:
      1. Our references (PRGs) need a length in the header. This could be: length of the longest path, number of kmers in the PRG, number of minimising kmers in the PRG, etc;
      2. Position (POS) needs to be an integer. If I were to choose, I would choose the length of PRGs being the number of minimising kmers in the PRG and POS being the index of the first minimising kmer the query hits (this is possible as we have a total order of the minimising kmers due to the PRG being a DAG).

After this full explanation of the current implementation of SAM files in pandora, I'd need your input on: what to change, add, remove, and when (i.e. now, or in a later version).

mbhall88 commented 1 year ago

This is brilliant Leandro. I have a couple of questions/suggestions.

Firstly, I am very keen to make it SAM-compliant if we can. As you say, it would be make things nicer for using samtools etc.

For QNAME do you think it might be better to either add the specific region as either a pandora-specific tag, or alternatively it could be encoded in the CIGAR string no (i.e. hard/soft clip)?

We could use the TLEN field too right? It could solve your POS problem? POS could be the position the index of the firt minimizer and TLEN could be the number of minimizers in the alignment or something?

I don't have any opinion on flags.

If we are going to use things like number of minimixers etc. then we should definitely encode w and k in the header.

leoisl commented 1 year ago

For QNAME do you think it might be better to either add the specific region as either a pandora-specific tag, or alternatively it could be encoded in the CIGAR string no (i.e. hard/soft clip)?

That is true, but the idea of adding the region in brackets in QNAME is not only to give this information, but also to differentiate it from possible further mappings. For example, let's say we have a ONT read ONT_read_1 mapping to 3 different PRGs. Then we would have 3 SAM records starting as:

ONT_read_1 ...
ONT_read_1 ...
ONT_read_1 ...

I think by the standard, one of these records must be the primary alignment and the other two secondary/supplementary. But that is not the information we want to convey, as we are not a linear reference aligner, it is normal for us to map different parts of the reads to different "references". By adding the region to QNAME, we create a unique id of (read, region). However, I do agree that encoding this inside QNAME is definitely not the optimal way to deliver this information, but is good to guarantee unique SAM QNAMEs. I think we can keep this in QNAME but also add it in the CIGAR as clipping as you suggested, as this it makes sense (we actually clipped the ends of the read, and just mapped a region).

We could use the TLEN field too right? It could solve your POS problem? POS could be the position the index of the firt minimizer and TLEN could be the number of minimizers in the alignment or something?

Isn't POS a position on the reference only (i.e. the query and alignment do not matter)?

POS: 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference base (see table below). The first base in a reference sequence has coordinate 1. POS is set as 0 for an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and CIGAR.

This is going to be ill-defined anyway, as it seems to me that the CIGAR string (removing clippings) applied to the sequence REF[POS:] should describe the alignment, but that works only for linear references. We can transform our graph to a linear representation by DAGazing the nodes, but it is still not a linear reference. This might point that we actually need to use GAM (the graph version of SAM? I guess this is the name?) instead of SAM, but in a future version...

My preference to use minimisers in the PRG as length of the reference and POS as the index of the minimiser is because this info is easy to get when writing the SAM file.

If we are going to use things like number of minimixers etc. then we should definitely encode w and k in the header.

Well remembered!

leoisl commented 1 year ago

In particular, LF, RF and PPCH are pandora-specific SAM extra fields, and can get quite large - I think we will by default remove these, but we can add them back with a pandora --extra-SAM-fields flag, as they can be quite useful for debugging.

Just changed this on https://github.com/rmcolq/pandora/commit/54812b005c4e35f92d44d1dde423189ee45c7eab : flank size is always 2*kmer_size now, so this field never gets large and can be always included. It would not get very large anyway before, as it was capped at 50 bps, but 50 just looked like a random number. It seems to me 2k on each side should be enough for racon to anchor kmers. And SAM files are actually requiredto run racon, together with these two extra SAM fields, LF and RF, as that is what we use to infer which parts of which reads mapped to each PRG: https://github.com/rmcolq/pandora/blob/3c3dba0819e96b4ce364da5bca19923d063f4b31/src/denovo_discovery/denovo_utils.cpp#L3-L52

mbhall88 commented 1 year ago

I think by the standard, one of these records must be the primary alignment and the other two secondary/supplementary. But that is not the information we want to convey, as we are not a linear reference aligner, it is normal for us to map different parts of the reads to different "references". By adding the region to QNAME, we create a unique id of (read, region). However, I do agree that encoding this inside QNAME is definitely not the optimal way to deliver this information, but is good to guarantee unique SAM QNAMEs. I think we can keep this in QNAME but also add it in the CIGAR as clipping as you suggested, as this it makes sense (we actually clipped the ends of the read, and just mapped a region).

We could just take the minimap2 approach (I feel like I say that a lot 😅) and mark them, arbitrarily, as supplementary, but have the tp:A tag to designate them all as primary? Or we potentially don't even need that if we are saying we only ever output primary alignments.

I'm just being a bit paranoid about an edge case where a read has [ and/or ] characters in it already. But I guess in that case we would just need to ensure when we parse that region out of a read, we do an r-search (from right hand side).

Anyway I'm fine to leave it if you think it's best there. We can always revisit later.

Regarding the POS stuff, how about we do this. Using POS 1{[343, 358)} from your example SAM file, we make POS 343, we make TLEN 15 and then we add a tag for 1?

leoisl commented 1 year ago

We could just take the minimap2 approach (I feel like I say that a lot ) and mark them, arbitrarily, as supplementary, but have the tp:A tag to designate them all as primary? Or we potentially don't even need that if we are saying we only ever output primary alignments.

I didn't know a single query could have multiple primary alignments. I agree with removing the region from the query name, as it makes more sense semantically. The information will now be encoded in the CIGAR, as the first and last soft clipping events. We could also describe the interval as extra SAM fields if needed, although it seems we don't need this for now.

Regarding the POS stuff, how about we do this. Using POS 1{[343, 358)} from your example SAM file, we make POS 343, we make TLEN 15 and then we add a tag for 1?

This looks ok for me, i.e. having 343 as POS in this example, we are putting as POS the position of the first kmer in the string representation of the PRG, which is a linear representation.

I am not sure however if we need to add the TLEN and tag for 1. With TLEN we can infer the number of minimizers in the alignment (i.e. the size of the aligned path in the PRG), and with this tag for 1, we can infer the exact position of the first alignment (e.g. 1{[343, 358)}). However, we do already have the PPCH extra SAM field that describes exactly the aligned path through the PRG, e.g.: PPCH:Z:1{[343, 358)}->3{[347, 360)[369, 370)[374, 375)}->3{[353, 360)[369, 370)[374, 381)}->1{[376, 391)}->1{[379, 394)}->1{[391, 406)}->1{[399, 414)}->1{[407, 422)}->1{[410, 425)}->1{[415, 430)}->1{[426, 441)}->1{[433, 448)}->. So the TLEN and tag for 1 can be completely inferred from PPCH, and looks like redundant info. But happy to add them if you think having them will be useful anyway.

leoisl commented 1 year ago

With your suggestions, I inferred that the length of a PRG should be the length of its string representation (as POS is a position in this representation). This is required to compose the SAM's @SQ headers. Besides these, soft clipping does not actually work for our use case: if we soft clip the ends of a long read, in the SEQ field we have to put the entire read (samtools complained about this). Hard clipping works though.

With these changes we now have a compliant SAM file :tada: ... well at least I can run samtools sort on it

mbhall88 commented 1 year ago

Awesome. Let's keep this open as I am sure we are going to want to make changes as we start using these files more deeply.

mbhall88 commented 1 year ago

FWIW I just used this SAM file for the first time to debug some things and it was super helpful in its current form.