pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
196 stars 40 forks source link

ensure that the optimal sum_of_path_node_distances is actually 1 #466

Closed subwaystation closed 1 year ago

subwaystation commented 1 year ago

Running the 1D sum_of_path_node_distances on this very simple graph

S   1   ACTACAGTA
S   2   CTGG
S   3   AAGTA
P   Genome1 1+,2+
P   Genome2 1+,3+
L   1   +   2   +   
L   1   +   3   +

with

odgi stats -i phd2.gfa -p -y -s
---
sum_of_path_node_distances:
  - distance:
      path: Genome1
      in_node_space: 0.5
      in_nucleotide_space: 0.692308
      nodes: 2
      nucleotides: 13
      num_penalties: 0
  - distance:
      path: Genome2
      in_node_space: 1
      in_nucleotide_space: 0.928571
      nodes: 2
      nucleotides: 14
      num_penalties: 0
  - distance:
      path: all_paths
      in_node_space: 0.75
      in_nucleotide_space: 0.814815
      nodes: 4
      nucleotides: 27
      num_penalties: 0

I noticed that the best metric goes below 1. Which does not make sense. This PR fixes this by always adding the information of the last node of a path to the sum.

subwaystation commented 1 year ago

The new metrics:

---
sum_of_path_node_distances:
  - distance:
      path: Genome1
      in_node_space: 1
      in_nucleotide_space: 1
      nodes: 2
      nucleotides: 13
      num_penalties: 0
  - distance:
      path: Genome2
      in_node_space: 1.5
      in_nucleotide_space: 1.28571
      nodes: 2
      nucleotides: 14
      num_penalties: 0
  - distance:
      path: all_paths
      in_node_space: 1.25
      in_nucleotide_space: 1.14815
      nodes: 4
      nucleotides: 27
      num_penalties: 0