vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

VG STATS: 2000 Properly Paired Inquiry #4434

Open AbdelF6 opened 3 weeks ago

AbdelF6 commented 3 weeks ago

1. What were you trying to do? VG stats on 3 files after aligning them using vg giraffe and the T2T-CHM13 reference.

2. What did you want to happen? vg giraffe to use paired-end mapping and properly paired values to be accurate

3. What actually happened? Got an error for vg giraffe: warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance warning[vg::giraffe]: Falling back on single-end mapping

For VG stats, had a properly paired value of 2000 for all 3 files After removing duplicates form the files using samtools markdup, no errors about cluster reads came up. I was wondering what the 2000 value means (is this a default output for when single-end mapping occurs? Also, is there a command in vg that can allow duplicates to be removed/notify user of duplicates in file?

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen? Files can be sent to be downloaded and for the command to be tested

6. What does running vg version say? vg version v1.60.0 "Annicco"

Place vg version output here
adamnovak commented 3 weeks ago

I think the feature request for a vg tool for marking/removing duplicates might be already tracked in https://github.com/vgteam/vg/issues/2662? The request is also mentioned in https://github.com/vgteam/vg/issues/3283. So people definitely want it.

Giraffe insists that your paired end fragment distribution (which you can set manually with --fragment-mean and --fragment-stdev) produce a maximum fragment size (of mean + 2 standard deviations) that is longer than the first read length + 50 bp, and also longer than the clustering --distance-limit (default 200 bp). If you have oddly long reads, or oddly short fragments (so that e.g. your reads are always sequencing the whole fragment), this might not be the case.

It sounds like the fragment length distribution, as learned from the non-deduplicated input reads, didn't satisfy this constraint, which is why Giraffe produced that Cannot cluster reads with a fragment distance smaller than read distance message. I'm not sure why that problem would go away when you removed duplicate read pairs from the input; maybe that also changed the order of reads in the file and made the statistics come out in a way Giraffe can handle? (Are you sure the correct pairs of reads are being fed in as pairs?)

What was the full warning message from Giraffe? It should have printed a couple more lines giving statistics about the fragment length distribution it observed: Fragment length distribution: mean=XXX, stdev=XXX and Fragment distance limit: XXX, read distance limit: XXX.

What is the fragment length distribution for this sequencing run supposed to be? If you actually are trying to run data where the fragments are meant to be very short relative to the reads, either we have to change the clusterer to not have this constraint, or you need to lie to Giraffe about your fragment length distribution with --fragment-mean and --fragment-stdev.

As for the statistics on the result showing 2000 reads marked properly paired, I think what is happening is that the initial 1000 pairs that we map with loose constraints, to try and learn the fragment length distribution, are being marked properly paired. But then all the reads mapped after that are falling back to single-ended mapping because of the problem that the warning indicates, and none of those are being marked properly paired. So we are actually producing output with 2000 reads that are marked properly paired.

AbdelF6 commented 3 weeks ago

@adamnovak Hi, thank you for your response!

The full warning message was: Cannot cluster reads with a fragment distance smaller than read distance Fragment length distribution: mean: 114.141, stdev = 33.3627 Fragment distance limit: 180, read distance limit: 200 Falling back on single-end mapping

I am not setting a fragment length distribution for this sequencing run, just using the default. Thank you for this suggestion!

adamnovak commented 3 weeks ago

@xchang1 I feel like we might be foggy on whether the "fragment length" is between the inner or outer ends of the reads in the fragment. Because it makes sense to have 150bp reads on a fragment with 114 bp between their inner ends, and it's weird that the clusterer wouldn't like that.

xchang1 commented 2 weeks ago

The fragment length should always be the distance between the outer ends.

The problem might be that we find the minimum distance between the outer ends, rather than the minimum distance between the inner ends plus the lengths of the reads. So the topologically minimum distance might end up being smaller than the lengths of the reads if there's a big deletion edge. I wouldn't expect that to happen a lot though.

@AbdelF6 what graph are you using? Is it just the CHM13 reference?

AbdelF6 commented 2 weeks ago

Hi @xchang1, yes I am using the CHM13-T2T reference to run on vg. I downloaded the indexes from this Github link: https://github.com/human-pangenomics/hpp_pangenome_resources/blob/main/hprc-v1.0-mc.md.

xchang1 commented 2 weeks ago

Ah ok. I wouldn't expect that problem on the HPRC minigraph-cactus graph. Are you shuffling your reads? Since we use the first thousand reads to figure out the fragment length distribution, we need those to be representative of the distribution we expect, but some sequencers put the bad reads first and mess up the distribution finder. Usually if that's the problem giraffe will complain about bad read mappings, and it's pretty weird that it found such a tight distribution, but maybe your reads are in a weird order. If that's not the problem, then are you able to share your reads with me? Just the first thousand should be enough for me to try to diagnose the problem

xchang1 commented 2 weeks ago

Oh also how long are your reads? Maybe they're just short and --distance-limit 200 is too big

AbdelF6 commented 2 weeks ago

Hi @xchang1, the reads are around 75 nucleotides base paired ends. But, we have run vg giraffe with other files, all with the same read lengths (around 75 nucleotide base paired ends), and we did not get this 2000 properly paired statistic or single-end mapping error for those runs.

If the read length is around 75, what would you recommend we set the distance limit to?

xchang1 commented 2 weeks ago

Maybe try 100? I would still recommend shuffling your reads if they aren't already. It's a bit weird that the fragment length that it's finding is smaller than the sum of the read lengths. I guess they could be overlapping though