lh3 / wgsim

Reads simulator
258 stars 91 forks source link

__gen_read routine does not reverse insertions on reverse strand #3

Open bredeson opened 12 years ago

bredeson commented 12 years ago

I'm not a C programmer, but (perl is similar enough and) I produced a 'git diff' of the changes I made to fix the issue. Perhaps there is a better way of doing this (insertion size is limited to length of longest end read), but tested it and it works:

git diff wgsim.old.c wgsim.c

diff --git a/wgsim.c b/wgsim.c index 5c82192..faf31c1 100644 --- a/wgsim.c +++ b/wgsim.c @@ -312,11 +312,20 @@ void wgsim_core(FILE fpout1, FILE fpout2, const char *fn, int is_hap, uint64_t tmp_seq[x][k++] = c & 0xf; \ if (mut_type == SUBSTITUTE) ++n_sub[x]; \ } else { \

delocalizer commented 11 years ago

Thanks for finding this problem. Another obvious (not neatest, nor most performant) fix in a few lines is to generate both reads in forward direction and reverse & complement tmp_seq[1] afterwards:

diff --git a/wgsim.c b/wgsim.c
index 55b8daf..85eef7d 100644
--- a/wgsim.c
+++ b/wgsim.c
@@ -323,8 +323,10 @@ void wgsim_core(FILE *fpout1, FILE *fpout2, const char *fn, int is_hap, uint64_t
                        } while (0)

                        __gen_read(0, pos, ++i);
-                       __gen_read(1, pos + d - 1, --i);
-                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = tmp_seq[1][k] < 4? 3 - tmp_seq[1][k] : 4; // complement
+                       __gen_read(1, pos + d - s[1], ++i);
+                       uint8_t revcomp[s[1]];
+                       for (k = 0; k < s[1]; ++k) revcomp[k] = tmp_seq[1][s[1] - k - 1] < 4? 3 - tmp_seq[1][s[1] - k - 1] : 4; 
+                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = revcomp[k];
                        if (ext_coor[0] < 0 || ext_coor[1] < 0) { // fail to generate the read(s)
                                --ii;
                                continue;
delocalizer commented 11 years ago

Addition to the fix proposed above to make sure ext_coor is correct:

@@ -319,12 +319,15 @@ void wgsim_core(FILE *fpout1, FILE *fpout2, const char *fn, int is_hap, uint64_t
                                                        tmp_seq[x][k++] = ins & 0x3;                            \
                                        }                                                                                                       \
                                }                                                                                                               \
+                               if (x) ext_coor[x] = i - 1;                                                                     \
                                if (k != s[x]) ext_coor[x] = -10;                                               \
                        } while (0)

                        __gen_read(0, pos, ++i);
-                       __gen_read(1, pos + d - 1, --i);
-                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = tmp_seq[1][k] < 4? 3 - tmp_seq[1][k] : 4; // complement
+                       __gen_read(1, pos + d - s[1], ++i);
+                       uint8_t revcomp[s[1]];
+                       for (k = 0; k < s[1]; ++k) revcomp[k] = tmp_seq[1][s[1] - k - 1] < 4? 3 - tmp_seq[1][s[1] - k - 1] : 4; 
+                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = revcomp[k];
                        if (ext_coor[0] < 0 || ext_coor[1] < 0) { // fail to generate the read(s)
                                --ii;
                                continue;
delocalizer commented 11 years ago

There is a subtle problem (or really 2 problems) with bredeson's original fix as it stands - reverse reads that terminate at or within insertions are incorrect. A reverse read that terminates exactly at the location of an insertion will have a last base equal to the reference rather than the first base of the (r.c.) of the insertion, and reverse reads that terminate in the middle of an insertion will have a truncated insertion.

e.g. with the following input

>test
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT

the following command (bredeson.wgsim compiled from wgsim.c with bredeson's
changes)

bredeson.wgsim -e 0 -d 200 -S 1367125828 -N500 -R 1.0 -X 0.7  -r 0.005 test.fa br.1.fq br.2.fq

generates the following single heterozygous insertion:

[wgsim] seed = 1367125828
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 400
test    213 -   TC  -

and the following reads with (nominally) no errors:

[crl@IMB10-02998 sim]$ grep -A1 _282_0: *.fq
br.1.fq:@test_52_282_0:0:0_0:0:1_116/1
br.1.fq-TAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTA
--
br.1.fq:@test_85_282_0:0:0_0:0:1_1c5/1
br.1.fq-GTAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTT
--
br.2.fq:@test_52_282_0:0:0_0:0:1_116/2
br.2.fq-GTAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTT
--
br.2.fq:@test_85_282_0:0:0_0:0:1_1c5/2
br.2.fq-AACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGG

note that the second and third reads are in reverse direction from 282, and
the last 3 bases in the reads are TTT. However their last base is at position
213, the location of the insertion, and so the last base should be the first
base of the r.c. of the insertion i.e. a 'G'. It isn't, it's the reference
base of T.

compare some reads 1 bp upstream from the same output:

[crl@IMB10-02998 sim]$ grep -A1 _281_0: *.fq
br.1.fq:@test_127_281_0:0:0_0:0:1_19/1
br.1.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT
--
br.1.fq:@test_130_281_0:0:0_0:0:1_f1/1
br.1.fq-GTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTT
--
br.1.fq:@test_61_281_0:0:0_0:0:1_12a/1
br.1.fq-GTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGG
--
br.2.fq:@test_127_281_0:0:0_0:0:1_19/2
br.2.fq-CCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGG
--
br.2.fq:@test_130_281_0:0:0_0:0:1_f1/2
br.2.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT
--
br.2.fq:@test_61_281_0:0:0_0:0:1_12a/2
br.2.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT

The first, fifth and sixth reads are in reverse direction from 281 and here we
see that the last 4 bases are TTAT, which is also incorrect, they should be
TTGA. This is actually the simplest way to tell that something is wrong in this
case - given the input and the single mutation, the motif TTAT should never
appear in any reads.

finally, compare some reads 1 more bp upstream:

[crl@IMB10-02998 sim]$ grep -A1 _280_0: *.fq
br.1.fq:@test_66_280_0:0:0_0:0:1_92/1
br.1.fq-AAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAA
--
br.1.fq:@test_78_280_0:0:0_0:0:1_11d/1
br.1.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.1.fq:@test_26_280_0:0:0_0:0:1_191/1
br.1.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.2.fq:@test_66_280_0:0:0_0:0:1_92/2
br.2.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.2.fq:@test_78_280_0:0:0_0:0:1_11d/2
br.2.fq-TTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAA
--
br.2.fq:@test_26_280_0:0:0_0:0:1_191/2
br.2.fq-AAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAA

and we see here at last the reads over the insertion (2nd, 3rd, 4th) are good,
ending in TTGAT.

The trickiness is representation in wgsim of insertion as sequence *after* a coordinate doesn't work as easily in reverse direction.