jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
237 stars 82 forks source link

graph-concordance documentation unclear #168

Closed d-cameron closed 2 years ago

d-cameron commented 2 years ago

Hello

I am attempting to validate whether sga is able to assemble a set of known truth variants and it is unclear whether sga graph-concordance is the appropriate utility for this.

The usage documentation check called variants for representation in the assembly graph is exactly what I'm after so I was expecting to be able to do something like

sga graph-concordance reference.fa reads.ec.filter.pass.asqg.gz variants.vcf

and it would tell me whether my variants that I've fed to sga correspond to valid paths through the assembly graph.

Unfortunately, it likes crashing:

$ sga graph-concordance
graph-concordance: missing arguments
graph-concordance: a reference file must be given.
graph-concordance: a germline variants file must be given.
Segmentation fault

I could eventually get it to run with:

sga graph-concordance --reference=$ref -r reads.ec.fastq -b reads.ec.fastq ../*.vcf -g ../*.vcf

and the variant outputs were all annotations of MaxUniqueVariantKmers=0;KmerClassification=GERMLINE.

It looks very much like sga graph-concordance is a kmer-based tumour/normal variant classifier and not a general purpose utility for evaluating whether a given set of variants has a corresponding path in the assembly graph.

What does sga graph-concordance actually do?

jts commented 2 years ago

Hi @d-cameron,

Sorry for the lack of documentation, this program was never completely finished. graph-concordance is indeed a somatic classifier. IIRC it constructs the expected haplotype from the input variant, then counts whether the k-mers of this haplotype are found in the tumour and normal reads. If there is sufficient k-mer support in the the tumour reads, but not the normal, the variant is annotated as being somatic. The code could be modified fairly easily to count k-mer coverage in a single sample only, if you (say) wanted to check whether a variant has a path through a de Bruijn graph for a given k.

I hope that helps, happy to chat if this points you in a useful direction.

Jared

d-cameron commented 2 years ago

I was looking for something that I could input a VCF + reference and, for each variant (SVs in my case), determine whether there was a path in the .asqg graph that supported the variant. kmer counting isn't particularly useful in my case as I was evaluating SGA graph construction in STR/VNTR regions for which de Bruijn graph assembly has failed. That is, I'm trying to validate to what extent OLC is viable in repeat regions. We know the contigs are broken at the loops/branches, but is the true path in the graph at all?

I'm in the process of writing my own version with very limited capabilities. It takes a VCF, asqg, and bam containing (minimap2) sga contig alignments, and outputs the path with length closest to that of the input variant.