plagnollab / DNASeq_pipeline

Pipeline in place at the UGI for DNA level analysis
10 stars 8 forks source link

chromosome name in reference fasta and -L option #7

Closed pontikos closed 9 years ago

pontikos commented 9 years ago

The chromosome name is different depending on the reference. For example the 1kg reference it doesn't have the "chr" prefix as opposed to the hg19 reference:

hg19:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

1kg:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

This is problematic because the -L used by GATK tools needs to be either -L chr1 or -L 1. We need a way of accounting for this. One idea is to simply look at the first fasta header line of the reference and see if chr1 appears.

vplagnol commented 9 years ago

Just add a "prefix" variable when the reference is set. If prefix is "", then it's the convention without the chr, otherwise set prefix to "chr"

pontikos commented 9 years ago

Fixed with commit 8888a488505690a91736727d98e98c7ad439d60d