samtools / htslib

C library for high-throughput sequencing data formats
Other
800 stars 446 forks source link

Sam to gfa? #1824

Closed Axze-rgb closed 3 weeks ago

Axze-rgb commented 3 weeks ago

Hello,

especially with the advent of phasing, I find myself needing more and more gfa instead of fasta, or in complement. I wonder if there is a tool in samtools to convert a sam to gfa? If not, I think it would be a cool addition and I think not difficult to implement (so far we use in house python scripts). a samtools gfa would be cool, unless you think it's too trivial (sorry I feel a bit uncomfortable suggesting something).

Thanks for samtools anyway, it's invaluable.

Cheers

jkbonfield commented 3 weeks ago

What exactly is it you wish? GFA is a graph, while SAM data is aligned against a linear reference. Are you asking for alignment records in GFA too? This feels like a really open-ended request so something more concrete (ideally with a worked example) would help us judge whether it is useful to the wider community and how much work is involved.

Axze-rgb commented 3 weeks ago

Well, gfa is extremely useful to inspect phasing in genome assemblies, find rare variants in meta genomics ,etc. here's an example of why converting an alignment to gfa might be useful

https://doi.org/10.1186/s13059-020-02157-2

jkbonfield commented 3 weeks ago

I don't deny genome graphs and pangenome references are very useful. They're far better at recording varying than a linear reference.

However I simply don't understand how you can convert a SAM file representing alignments against a linear reference to a graph. It feels like inventing something out of nothing, which therefore can't have any use that I can see. The correct solution, if you want a graph alignment, would be to discard the SAM file and realign everything to the graph. That doesn't feel like something for htslib to do though.

That said there is clearly a long-term need for better tooling and representation of file formats and manipulation of graph objects, but this issue is too vague for us to understand what is being asked for. Sorry.

Axze-rgb commented 3 weeks ago

This is ok, I a agree with your answer and understand your point of view. I was looking for something that could easily convert an alignment in a graph, using the information in the Sam about what contigs align to what reference and reconstructing the paths like that. I agree with your statement it would be reinventing something.

I understand where you are coming from and somehow I realize I would love a htslib for graphs. I think the future might have workflows with mapping/phasing. But it's a feeling and it would be reinventing a whole set of tools, actually.

Thanks for your time

jkbonfield commented 3 weeks ago

Ah so this is a bit clearer. If I understand it, you already have a graph alignment (presumably in .gaf or similar) as well as an alignmenr to a linear reference (in SAM equivalent). Is that correct?

If so, what is it you then want? You presumably have a third piece of knowledge somewhere about how the pangenome graph and the linear reference match. What format is that in? I'm somewhat new to the graph world and it seems to be a diverse set of competing standards so I'm really not sure what formats are in play at present.

One thing I have played with though is GraphAligner. It uses GAF format (confusingly named and not to be misread as GFA). One of the fields in this is a path, like >node1>node2>node2>node2>node7. It also has a CIGAR string. If we follow the named path in the GFA we can construct a sequence by walking that path, which we then align our query sequence against, generating the CIGAR string. It even has NM tags counting the number of discrepancies. It could even be used as a better reference for reference-based encoding schemes like CRAM.

This does feel like it's a SAM variant already, with the classical linear reference being replaced by a path. Some form of convergence would probably be good, but this is a huge project.

Axze-rgb commented 3 weeks ago

Yes I realise there was some sort of misunderstanding. I think now we are on the same page. Let's say I am convinced more and more we will have phased genomes or metagenomes and instead or having two or more fasta files, which is impractical, everything could be in a gfa, and indeed sequencing reads could be aligned on the gfa.

What interests me is the discovery of variants but we are somehow in a between worlds with some references being the classical haploid fasta on which you align reads, and some with an assembly graph on which I can directly map the reads. And actually no real method to compare the two formats.

In theory, I think gfa with mapped reads would better represent genomes and their natural variations (you might disagree with that). But indeed transitioning from one representation to the other is not a trivial task.

Let's say I can imagine there might be in the future a need for a htslib for reads mapped on a gfa, to manipulate easily a gaf, or any other format that would become a standard (like Sam/Bam have become). But as you said it's likely a colossal programming project, more than I initially envisioned.

Thanks for your patience, I really appreciate because my initial query was not sufficiently clear, and it's 100% my fault. I appreciate your feedback.