rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

parameter setting to print out POA #13

Closed ksahlin closed 3 years ago

ksahlin commented 5 years ago

Hi Robert,

Spoa seems like a great tool/library that were looking to use in our next project. Would it be possible for you to add a parameter setting (or add a setting to the -r) to print the POA graph? Regarding the format I maybe it could be an adjacency list or similar, with weights in each-edge tuple.

I'm assuming all the information necessary is in the structure graph = spoa::createGraph(); but I'm unfortunately not proficient in C/C++ and looking to call spoa from python.

rvaser commented 5 years ago

Hi Kristoffer, sure can do :) The current implementation has the dot format (https://en.wikipedia.org/wiki/DOT_(graph_description_language)), will that suffice?

Best regards, Robert

rvaser commented 5 years ago

The sample data provided in test/ yields the following image (gold nodes are the consensus sequence):

g

ksahlin commented 5 years ago

Great, thanks!

Dot format should work great and is actually easy to import into python/networkx. The important thing is to have all the information present in the graph to the output (e.g., directionality, weights).

rvaser commented 5 years ago

Could you please specify which information you want in node labels and which in edge labels? Current version has node_id and character for nodes, and weight for edges. Mismatches are connected with directed edges but the style is changed to -- and arrows are removed because I do not know how to mix and match directed and undirected edges). You want to only plot the graph or do something else with it?

ksahlin commented 5 years ago

It would actually be helpful with your expertise on that question for what is needed. The goal is to investigate alternative ways to derive consensus sequences from this graph (not necessarily the ML path).

Regarding the two different edge types: as long as they are distinguishable it should suffice. It actually makes sense not to have arrows on the mismatches -- so thats preferred. I can easily implement my own parser for this format if it becomes complicated and doesn't follow standards.

Weights on edges and directionality (not for mismatches) is definitely crucial. However, is there any additional information used for construction of consensus? Things like weights on nodes comes to mind.

ksahlin commented 5 years ago

Oh, misread your post. It doesn't matter that mismatches are directed, Ill take that into account in my parser!

rvaser commented 5 years ago

No additional info is used in the heaviest path algorithm. I think the current format will suffice then. I'll push the update to master in the morning.

ksahlin commented 5 years ago

Excellent, thanks!

rvaser commented 5 years ago

I pushed an optional parameter -d <file> to master branch (v1.1.4). Let me know if you need anything else.

ksahlin commented 5 years ago

I tried it out and seems to work great, Thanks! I noticed however that all edges have double the weights of the number of sequences that passes through the edges. Is this expected?

For example,

>1
TCCGAC
>2
TCCGGC
>3
TCCGCC
>4
TCCGAC
>5
TTCGAC

Gives

image

rvaser commented 5 years ago

When using phred qualities I wanted that both bases somehow contribute to the edge weight so I decided to use averages but left out the division/shift. When you are not using any weighting scheme, the edge weights will be equal to number of sequences passing through times two.

I see now that I am printing node characters in integer form. Fixed it in the latest commit.

ksahlin commented 5 years ago

Ok that makes sense, and great to know it supports phred values in the graph.

I consider this feature request solved, but I will explore this over the coming weeks and let you know of there's something that comes up.

Thanks a lot for your help!

ksahlin commented 5 years ago

Hi again,

I noticed a strange behavior in the graph representation. To me, it looks like one of the sequences is missing from the graph. I ran spoa seqs.fasta -l 2 -g -2 -d seqs.dot. I have attached two examples showing this:

>1
TCCGAC
>2
TCCGGC

gives image

and

>1
TCCGAC
>2
TCCGGC
>3
TCCGGC

gives

image

rvaser commented 5 years ago

I guess you are missing a new line at the end of your file. I have updated the parser for this but didn't pull it here yet. Will fix it in a moment.

rvaser commented 5 years ago

Should be fixed in 1.1.5. I enabled compressed input files as well.