maickrau / GraphAligner

MIT License
255 stars 30 forks source link

How to get the unique matching path for each read #95

Open xujialupaoli opened 8 months ago

xujialupaoli commented 8 months ago

Hi, Thanks for the clear documentation and awesome tool. I would like to get the unique matching path for each read.But through the following code, I saw that in the output gaf file, many reads correspond to multiple matches. Is there any way to filter out the best matching results? I hope you can help me to address it effectively. My command GraphAligner -g all_merge_W.gfa -f reads.fq.gz -a LR.gaf -x vg -t 64

image

maickrau commented 8 months ago

Hi, the output GAF has the mapping quality column which can be used to compare alignments which span the same area. You can filter by mapping quality for example with awk -F '\t' '$12>20' < alignments.gaf > filtered_alignments.gaf which will only select alignments with mapping quality > 20. You can also filter by alignment length, for example awk -F '\t' '$4-$3>1000' to select only alignments with length at least 1000bp. Combining them should filter out most spurious alignments: awk -F '\t' '$12>20 && $4-$3>1000'. However if the read has multiple equally good alignments (eg. the read is entirely contained within a repeat) then this will filter out all of them, since none of them are uniquely better than others.

In that specific picture it seems like that read doesn't have any good alignments. All of the alignments are very short and likely spurious.

xujialupaoli commented 8 months ago

Thank you for your reply, I will try the method you gave me. I would like to ask if I can determine the best match based on the maximum value of identiy marked by the red box in my picture? Is it reasonable to filter by the "identity" column?

image

maickrau commented 8 months ago

You can filter by identity to remove spurious alignments but I would suggest against using it to pick the best match among the secondary alignments. The issue is that the identity does not consider alignment length, so a shorter alignments might have a high identity if the sequence just happens to match some location by chance, even if a longer, slightly lower identity alignment might be biologically more realistic. It looks like your reads might be HiFi reads? In that case I suggest using the parameter --precise-clipping 0.95 for GraphAligner which will perform alignment identity based filtering and read clipping during alignment. The parameter should be set to somewhat lower than the expected alignment identity, eg. HiFi expected to be >99% identical but the parameter uses the threshold 95% instead.

xujialupaoli commented 8 months ago

Thank you !

xujialupaoli commented 7 months ago

Hi ! Thank you for your previous explanation. I still have a problem. In order to subsequently analyze the abundance information of different nodes, I can only use json file to obtain this information. But I still need to extract the unique matching path for each read from the gaf file. How can I output gaf and json files at the same time?

maickrau commented 7 months ago

You can use multiple output formats at the same time in GraphAligner, for example -a alignments.gaf -a alignments.json will write both. If you filter the alignments afterwards then you would need to do the same filtering on both files. The two outputs have the same information but in a different format, so getting the abundance from gaf should also be possible, but needs a different script than json.