maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Index for graph genome #7

Closed RenzoTale88 closed 4 years ago

RenzoTale88 commented 4 years ago

Hello, My name is Andrea, and i got a question concerning GraphAligner. I got a mammalian-size graph genome in the VG format. How can I specify the indexes to use for the alignment step? It will simply look for the xg/gcsa files with the same name as the graph provided and use them? Or do I have to specify them instead of the vg graph?

Thank you in advance, Andrea

maickrau commented 4 years ago

Hi Andrea,

The aligner will automatically build the index when you run it. You don't need to (and can't) pre-build the index and it won't use the xg/gcsa indices. The graph should be given in vg format.

RenzoTale88 commented 4 years ago

Thank you very much for your quick answer!

All the best,

Andrea

RenzoTale88 commented 4 years ago

I get the chance to make an additional question. Is it possible to use the aligner also for paired-end short reads, or it is specifically designed to work only with long reads?

Andrea

maickrau commented 4 years ago

It's only designed to work with long reads. The current indexing strategy won't work for short read alignment and it doesn't use paired-end information.

RenzoTale88 commented 4 years ago

Ok thanks you very much for your answer.

Andrea

kevyin commented 4 years ago

Hi @maickrau , Thank you for graphaligner, was just hoping to clarify if the indexing limitations for short reads is only during the seeding stage or is it also limited during extension?

Was wondering if it was possible to load external seeds to increase short read mapping performance Thanks

maickrau commented 4 years ago

I haven't benchmarked the extension on short reads. The extension is optimized for long reads but it might work for short reads as well. You should use the hidden parameter "--precise-clipping" which will clip the read ends more accurately, and "--try-all-seeds" . It might still give you some alignments with ~90% identity because it expects long read error rates so you should filter based on identity afterwards.