maickrau / GraphAligner

MIT License
261 stars 32 forks source link

Incomplete edits in alignments #4

Closed tomokveld closed 5 years ago

tomokveld commented 5 years ago

Hi, I am encountering some unexpected behavior, it seems that not all bases are accounted for in the edit fields of certain alignments.

I have been aligning onto a single genome "graph", and comparing the results with vg. In most cases the alignments look fine, but in some cases the edits don't cover all the bases of a read. To illustrate consider the (below) 100bp alignments of one read with GraphAligner (92 bp accounted for) and vg (100 bp accounted for). Is this supposed to happen?

I also tried doing the alignments on an actual variation graph, and the bases of this particular read seem to then be properly accounted for in the edit field (although there are other reads which still have missing bases).

Thanks,

Tom


GraphAligner

{
  "identity": 0.945054945054945,
  "name": "ERR239432.3455336",
  "path": {
    "mapping": [
      {
        "edit": [
          {
            "from_length": 76,
            "to_length": 76
          },
          {
            "sequence": "C",
            "to_length": 1
          },
          {
            "from_length": 2,
            "to_length": 2
          },
          {
            "sequence": "TT",
            "to_length": 2
          },
          {
            "from_length": 2,
            "to_length": 2
          },
          {
            "sequence": "GG",
            "to_length": 2
          },
          {
            "from_length": 7,
            "to_length": 7
          }
        ],
        "position": {
          "is_reverse": true,
          "node_id": "40236",
          "offset": "181"
        }
      }
    ]
  },
  "score": 13,
  "sequence": "CCCTGCAGCTGAGGCATCTGCTCTCTTTCAACATACTCTCTTCTGGAAAGAGGGAGAGCAGGAAGCATCATGGGTTTCACTTTACGGTTGGGGAAGCTGG"
}  

vg

{
  "identity": 0.76,
  "mapping_quality": 60,
  "name": "ERR239432.3455336",
  "path": {
    "mapping": [
      {
        "edit": [
          {
            "from_length": 76,
            "to_length": 76
          },
          {
            "sequence": "TCACTTTACGGTTGGGGAAGCTGG",
            "to_length": 24
          }
        ],
        "position": {
          "is_reverse": true,
          "node_id": "40236",
          "offset": "181"
        },
        "rank": "1"
      }
    ]
  },
  "quality": "Ix8gHx4mICQiGCImICYWJR8fJh8gICMnIR8mIh8nHhwhJhkeHSEdJCYkIyAmIxwbHB8nHRodDxEdDx0ZIRwMIiUgJCMdICUiJiImICYhGSImJScmJCMnIiUZFhklJSYeBwYXFg==",
  "refpos": [
    {
      "is_reverse": true,
      "name": "6",
      "offset": "40219069"
    }
  ],
  "score": 81,
  "sequence": "CCCTGCAGCTGAGGCATCTGCTCTCTTTCAACATACTCTCTTCTGGAAAGAGGGAGAGCAGGAAGCATCATGGGTTTCACTTTACGGTTGGGGAAGCTGG",
  "time_used": 1345
}
maickrau commented 5 years ago

Hi, could you upload the read and the graph as well?

tomokveld commented 5 years ago

Yes, sure. It's too big for github, so I put it on Dropbox: https://www.dropbox.com/sh/zw8dc4z1cbpz03y/AABrPTzxSeGQu1GkZvu0iHCBa?dl=0

maickrau commented 5 years ago

Fixed in commit 8e37ecb. Will be included in the next bioconda release.

The output edit fields were missing indels in some cases, here the first 8bp of a 9bp insertion were missing. The identity field also reported too high identities because of this.

tomokveld commented 5 years ago

Thanks, I will have a look at it.