lh3 / wgsim

Reads simulator
258 stars 91 forks source link

missing the nucleotides of the start or stop codons #11

Closed qingl0331 closed 8 years ago

qingl0331 commented 8 years ago

Hi, Very sorry for disturbing you! But when I use wgsim to simulate reads from reference transcriptome and then do trinity assembly using the simulated reads, many of the assembled sequences are missing all/two/one of the nucleotides of the start or stop codons. It happens every time I do the simulation for different dataset. I wonder , whether there's a bug for that... Thank you! Best, Qing

lh3 commented 8 years ago

No, this is not a bug. If your reference genome is CDS only, the first and last few bases will always have very low coverage. You should consider to change simulation.

qingl0331 commented 8 years ago

Thx! But I simulate the reference transcriptome with UTRs ... Is there reason that for the low coverage for the 1st and last few bases for coding seq? Any strategy for simulation to get around it? Thank you!

2015-11-04 7:48 GMT-07:00 Heng Li notifications@github.com:

No, this is not a bug. If your reference genome is CDS only, the first and last few bases will always have very low coverage. You should consider to change simulation.

— Reply to this email directly or view it on GitHub https://github.com/lh3/wgsim/issues/11#issuecomment-153749542.

lh3 commented 8 years ago

Many UTRs are incomplete. I think it is just a nature of your simulation procedure.

qingl0331 commented 8 years ago

Nope...at least one of the datasets is manually curated and with complete UTRs, and it still missed the 1st and last few bases. Would you mean there's no way to get around it for the simulation?

2015-11-04 8:04 GMT-07:00 Heng Li notifications@github.com:

Many UTRs are incomplete. I think it is just a nature of your simulation procedure.

— Reply to this email directly or view it on GitHub https://github.com/lh3/wgsim/issues/11#issuecomment-153755214.

lh3 commented 8 years ago

There is no way for wgsim to bias against CDS, so I don't see how this can be a bug in every way. Maybe RNA-seq simulation is different from DNA-seq due to the extra adapters and poly-A tails. As I said, it is just a nature of your simulation procedure.

qingl0331 commented 8 years ago

Hi, I find an example: this simulation dataset is just ORFs (no UTRs and no polyA) so there shouldn't be any difference in terms of RNA or DNA. The complete sequence that is originally entered is returned after reassembly minus one or two nucleotides from the 3' or 5' end. This is the case for most seq entered the simulation. Shall I add some nt to the end or front then? e.g. entered sequence: ATGGGCATGCGGATGATGTTCACCGTGTTTCTGTCGGTTGTCTTGGCAACCACTCTTGTTTCCTTCACTTCAGGTCGCCGTGATAAAGCCAGTCACCAGAAGCGCGACTGTCCAGTGACTGGAGGCCCTAACCCCTTCCACCATTGCAAGATAGCCTGCATGAGCACCGGCACGGAAGAGTATTGTAACTGTGTCTACTGCAAGGATTGCGTCAATAGCAACGGGGAGAAGCCGGCGTGCTGA

returned sequence: ATGGGCATGCGGATGATGTTCACCGTGTTTCTGTCGGTTGTCTTGGCAACCACTCTTGTTTCCTTCACTTCAGGTCGCCGTGATAAAGCCAGTCACCAGAAGCGCGACTGTCCAGTGACTGGAGGCCCTAACCCCTTCCACCATTGCAAGATAGCCTGCATGAGCACCGGCACGGAAGAGTATTGTAACTGTGTCTACTGCAAGGATTGCGTCAATAGCAACGGGGAGAAGCCGGCGTGCTG

2015-11-04 8:36 GMT-07:00 Heng Li notifications@github.com:

There is no way for wgsim to bias against CDS, so I don't see how this can be a bug in every way. Maybe RNA-seq simulation is different from DNA-seq due to the extra adapters and poly-A tails. As I said, it is just a nature of your simulation procedure.

— Reply to this email directly or view it on GitHub https://github.com/lh3/wgsim/issues/11#issuecomment-153765380.