tk2 / RetroSeq

RetroSeq is a bioinformatics tool that searches for mobile element insertions from aligned reads in a BAM file and a library of reference transposable elements. Please read the wiki page (link below) for usage instructions. Also, there is a page on the wiki describing how the 1000 genomes CEU trio was carried out with the files and parameters used for the various steps.
64 stars 25 forks source link

Coordinate system query #2

Open cbergman opened 9 years ago

cbergman commented 9 years ago

In the RetroSeq VCF file the position for TE insertions relative to the reference are given on 1-based coordinates in the POS column. In addition, there are a set of two consecutive coordinates in the INFO field, the first of which corresponds to the POS column, and the second corresponds to the next base in the genome. Does this imply that the predicted insertion would intergate between the first and second positions in the INFO field? In other words, to convert RetroSeq predictions to 0-based coordinates, do we (i) use the two coordinates in the INFO field, or (ii) subtract 1 from the POS column to make a new start position on 0-based coordinates?

tk2 commented 9 years ago

Yes, that is correct. But to be honest, I never consider the breakpoints to be accurate to the exact bp. Some mini local assembly and realignment could get them to bp accuracy, I just never got around to implementing that.

cbergman commented 9 years ago

Thanks and sorry for the slow reply.

We are assuming that "that is correct" refers to "Does this imply that the predicted insertion would integrate between the first and second positions in the INFO field?".

This means that RetroSeq is using the INFO field to represent the TE insertion location (which is in reality inter-base) on 1-based coordinates by annotating a consecutive span of 2 nucleotides, with the insertion site being between the first and second nucleotide. This 2-nucleotide span cannot be represented directly in the POS column of the VCF file, which only allows a 1-based single nucleotide feature to be annotated.

To convert RetroSeq output to 0-based BED format in https://github.com/bergmanlab/mcclintock, we will maintain the 2-nucleotide framework, and thus annotate POS-1 for the start and POS+1 for the end of the 2-nucleotide interval.