jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
501 stars 248 forks source link

Script needed for assembly evaluation #36

Open jorvis opened 8 years ago

jorvis commented 8 years ago

We have need of a script which simulates fragmented sequences based on more-complete input sequence. This is perhaps best illustrated with a current use case.

We are using unsheared, paired-end reads aligned to transcriptome assemblies to determine real evidence for each, or even possibly group them further. We expect overlapping transcripts like this to be assembled:

5'---------------------3'
               5'----------------------------------3'

But paired-end grouping might also be able to pull these together, even inserting Ns given a known library insert size, if read mate pairs span the gap between them:

5'---------------------3'
                                          5'----------------------------------3'

So, here, this proposed script would allow me to take a known set of transcripts and artificially fragment them, generating some fragments that overlap and others that are separate from one another. This could be controlled with user-configurable options such as:

--min_overlap_distance=-200 --max_overlap_distance=100 --fragmentation_factor=6

Notice the negative value above, which allows for the 2nd case above where sequence fragments do not overlap. With these options, the script would transform a FASTA file with 1000 sequences into one with around 6000 sequences, with fragments generated with an overlap distance of up to 100bp and as far as 200bp apart from each other based on their parent sequence.

Data should be appended to the header descriptions in the product sequences to indicate their source and coordinates.

jorvis commented 8 years ago

As a side note, we should evaluate this with what this does: http://www.ncbi.nlm.nih.gov/pubmed/22962361