iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

GFA format and ML path format #23

Open bricoletc opened 3 years ago

bricoletc commented 3 years ago

@leoisl @mbhall88 could you paste here an example of the ML-path format that pandora uses, to describe a ML path with respect to the linearised prg?

Then I can implement it in gramtools.

Or, we could move to expressing this ML path on a GFA; this means pandora and gramtools would need to use the GFA produced by make_prg.

leoisl commented 3 years ago

This is the current output of a sample example in the make_prg branch:

2 samples
Sample toy_sample_1
1 loci with denovo variants
GC00010897
9 nodes 
(0 [0, 110) ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCACCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC)
(1 [113, 114) G)
(3 [121, 171) GAAATGCCCACGACCGGGCTGGATGAGCTGACAGAGGCCGAACGCGAGAC)
(4 [174, 175) G)
(6 [182, 301) CTGGCCGAATGGCTGGCCAAGCGCCGGGAAGCCTCGCAGAAGTCGCAGGAGGCCTACACGGCCATGTCTGCGGATCGGTGGCTGGTCACGCTGGCCAAGGCCATCAGGGAAGGGCAGGA)
(7 [304, 308) ACTG)
(9 [319, 360) CGCCCCGAACAGGCGGCCGCGATCTGGCACGGCATGGGGGA)
(10 [364, 365) G)
(12 [374, 491) GTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGCTGATCCCAAGGATCAAGGGGAGGGGGCACCAAAGGGGAAATGA)
2 denovo variants for this locus
toy_sample_1.GC00010897 44  .   C   T   10.7923 .   DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42   GT:PL:GP:GQ 1:40,0:-2.14748e+09,0:127
toy_sample_1.GC00010897 422 .   A   T   10.7923 .   DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42   GT:PL:GP:GQ 1:40,0:-2.14748e+09,0:127
Sample toy_sample_2
1 loci with denovo variants
GC00006032
11 nodes 
(0 [0, 145) TTGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG)
(2 [152, 153) T)
(3 [156, 169) CTCCCTGGCTAAT)
(5 [176, 177) A)
(6 [180, 237) ACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCA)
(8 [245, 246) T)
(9 [249, 267) CAAAATGAAACACTGCGA)
(11 [276, 277) T)
(12 [281, 290) ATTAAAATT)
(14 [299, 300) A)
(15 [304, 312) TTTCAATT)
1 denovo variants for this locus
toy_sample_2.GC00006032 49  .   A   G   10.7923 .   DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42   GT:PL:GP:GQ 1:40,0:-2.14748e+09,0:127

Variants are now described as VCF records, but ML path representation is still a proprietary internal format. Getting one example of one node: (3 [156, 169) CTCCCTGGCTAAT)

3 is an internal id that pandora gives to this node, should be completely ignored; [156, 169) is the interval that the sequence of this node CTCCCTGGCTAAT spans in the textual representation of the PRG;

One issue is that we use this sequence interval in the textual representation to match PRG nodes to nodes in the recursion tree. I guess the proper solution would be make_prg giving an id for each node in the PRG, so any tool processing a PRG can refer to a node by its id