maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Completely turn off new clipping behaviour #40

Open subwaystation opened 3 years ago

subwaystation commented 3 years ago

Hi @maickrau Since the new release and as you mentioned in https://github.com/maickrau/GraphAligner/issues/28, per default --precise-clipping is turned on. For my evaluation of reconstruction accuracy, I would like to turn this completely off. However, the lower limit is 0.501:

precise clipping identity cutoff must be between 0.501 and 0.999

Is there a specific reason? Thanks for any feedback!

maickrau commented 3 years ago

The reason for the lower limit is because random alignments have an about 50% identity so a lower cutoff will treat random alignments as valid alignments. Can you say a bit more about your evaluation? Do you want to have the entire sequences aligned end-to-end?

subwaystation commented 3 years ago

Ah, that's where it's coming from. In https://github.com/pangenome/pgge I am measuring the reconstruction accuracy of a pangenome graph. I want to find out how well the sequences, we created the pangenome graph from, are preserved in the actual graph. Here I use a so called query sequence containment metric https://github.com/pangenome/rs-peanut#query-sequence-containment-qsc. Unfortunately, GitHub images are broken somehow at the moment, so here the idea: I just count the number of nucleotides matches across all queries and divide these by the number of all query lengths. If the cutoff is at 0.501, I will miss some nucleotide matches of the query. So I won't get the complete picture. I understand the need to prevent random alignments, but for us it would be helpful to take a look at everything. Does this make sense to you?

subwaystation commented 3 years ago

To give a concrete example: One sequence in the graph is the full chm13 chr8 sequence. When aligning this sequence back to the created pangenome graph, we split the chm13 chr8 sequence into sizes of 100kb to not run out of memory and be more efficient in mapping. So clipping does not make sense here.

subwaystation commented 3 years ago

@maickrau I could do a PR to allow --precise-clipping to not be lower than 0.001 from https://github.com/subwaystation/GraphAligner/tree/precise_clipping_min_0.001 as it was the case in the old GraphAligner version. Not sure if 0.0 would slow down GraphAligner?

Anyhow, @ekg and me would be very happy, if we could have chat with you about sequence to graph alignment :) Hard to find your mail, so here is mine: simon.heumos@qbic.uni-tuebingen.de. I can arrange things or feel free to contact me. Cheers!