lh3 / wgsim

Reads simulator
263 stars 90 forks source link

read simulation for reverse and forward offset is not consistent. #2

Open jozerffer opened 13 years ago

jozerffer commented 13 years ago

Hi,

As mention above in the title. Read simulation for reverse and forward mutation offset is not consistent. Please check.

Example:

gi|224589813|ref|NC_000021.8|_9440728_9441261_0:0:1_0:0:0_42167/2 (forward) ATGTCAAGATAATGTCAGAAATTCTTTACAATTGCTTCCAGAAGGAGTAGCCTTTTGATCTAGTGCACAGGTGTCCAGTC (TTTTA) GGCTTCTTAGGGCCA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

@gi|224589813|ref|NC_000021.8|_9440248_9440824_0:0:0_0:0:1_93e94/1 (reverse) GCCCTAAGAAGCC (ATAAA) GACTGGACACCTGTGCACTAGATCAAAAGGCTACTCCTTCTGGAAGCAATTGTAAAGAATTTCTGACATTATCTTGACATGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

simulated insert gi|224589813|ref|NC_000021.8| 9440811 - A +

lh3 commented 13 years ago

What is the problem? I do not see.

jozerffer commented 13 years ago

Look at the forward read with simulation read of C(TTTTA)GGC and reverse read is GCC(ATAAA)G.

It is inconsistent.

jozerffer commented 13 years ago

From wgsim output, pileup format show: 21 9440811 - A +

jamesls79 commented 13 years ago

Hi, tried using wgsim and encountered the exact problem as what reported by jozerffer. Perhaps, I can give a more visual description of the problem as follows (monospaced font would illustrate it a lot better):

Read 1: forward, has an A inserted at the 85th base of the read (represented below with an uppercase) 80 cttttAggctt 90 9440807 ctttt-ggctt 9440816

Read 2: reverse, has a T inserted at the 15th base of the read (represented below with an uppercase) 10 agccaTaaaga 20 9440815 agcca-aaaga 9440806

If, referring to the sim list of insertions, I would think that Read 2 should be

10 agccTaaaga 20

Your prompt reply to this post is much appreciated.

Thanks, James

bredeson commented 12 years ago

I'm seeing something similar. All (-)-strand reads have two or more nts upstream of their indel, reversed with respect to those on the (+)-strand:

In the following alignment:

GC_C_CG .. . .. CGC_CACG CGC_CACG CGC_CACG CGC_CACG cgcac_cg cgcac_cg cgcac_cg cgcac_cg

the forward -CA becomes AC- on the reverse strand.

I also have a feature request: output a SAM/BAM file with the true alignments (CIGAR+MD tag) .

jkbonfield commented 9 years ago

Duh I just noticed this too late. It may be the same bug I just fixed in the samtools version:

https://github.com/samtools/samtools/pull/428