tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

repeat unit in the reference? #52

Closed shukwong closed 6 years ago

shukwong commented 6 years ago

I was wondering if it is possible to also provide the repeat unit in the reference? or provide a script to add it? I was looking to aggregate all the STRs with the same repeat units and couldn't find an easy way to do this.

Thanks! Wendy

tfwillems commented 6 years ago

Hi Wendy,

This is a great suggestion! However, I'm hesitant to add the repeat unit to the reference BED files, as they'll no longer be backward compatible. Would it be helpful if I added the repeat unit to the output VCF produced by HipSTR? It should be fairly straightforward to infer from the reference allele and so I'd be happy to incorporate it in the genotyping process and report it as an INFO field in the VCF.

Or for your analysis would it be easier to annotate the reference BED with this info?

shukwong commented 6 years ago

Hi Thomas,

Thanks for your help! I think adding it to the INFO field in the VCF is definitely very helpful!

For my current analysis since all the VCFs have been generated it's easier to get this information from somewhere else. It looks like tandem repeat finder will give this information. I was wondering if you have this information handy to save me from figuring out how to run TRF :) If adding the repeat unit to the reference BED files isn't backward compatible, would it be possible to have another file with just the STR name and the repeat unit?

Thanks, Wendy

tfwillems commented 6 years ago

Hi @shukwong ,

I've gone ahead and updated all of the HipSTR references to include an additional column that contains the repeat motif on the 5' strand. If you redownload the appropriate reference, the information should be available now.

I've had to change the ID names of the STRs in the reference, as modifications to the reference-building process slightly changed the number of repeats in the human genome. As a result, you won't be able to match your STRs to the new reference using the STR_.... id column, as they're now labeled HumanSTR..... and the numbers don't line up. However, you should be able to match them up by chromosome, start and stop coordinates.

Most repeats will have a single motif reported in the column. However, some repeats have multiple repeat motifs delimited by a "/" character. For these loci, the STR is actually made up of several merged Tandem Repeats Finder entries that all share the same repeat unit length. So for instance, a repeat that looks like ACACACACAGTGTGTGTATATATAT, might have a repeat motif column like AC/GT/AT

Let me know if you have any questions or issues

Best, Thomas

shukwong commented 6 years ago

Hi Thomas,

Thank you very much for the generating the reference with the repeat units for me - it helped tremendously! I am able to map the new with the old with the exception of 7 STRs (which is not a problem).

Best, Wendy

tfwillems commented 6 years ago

Great, glad to hear it was helpful! Those 7 repeats had inconsistent motifs between hg19 and hg38 and so were filtered

Best, Thomas