mdozmorov / genome_runner

Academic Free License v3.0
0 stars 3 forks source link

dbcreator: Filtering spurious chromosome names #18

Closed mdozmorov closed 10 years ago

mdozmorov commented 10 years ago

Often tables contain genomic coordinates on chromosomes other than standard set, e.g.,

chr17_ctg5_hap1 chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chrUn_gl000248

This lead to incorrect p-value calculations in some cases, making random overlaps significant.

How can we filter out such genomic features? Can think about calling awk from within dbcreator, but wonder if there is another way?

mdozmorov commented 10 years ago

The following regular expression will select only the correct chromosomes

zcat gwasCatalog.bed.gz | grep "\bchr[0-9XYM][^_]\b" > gc_clean.bed

Additionally, ensure the end coordinates are larger than the start, a step also needed to be automated.

awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' gc_clean.bed | sort -k1,1 -k2,2n -k3,3n | uniq > 3gwasCatalog+.bed && bgzip 3gwasCatalog+.bed && tabix 3gwasCatalog+.bed.gz

mdozmorov commented 10 years ago

Or, combining the two:

zcat snp138.bed.gz | grep "\bchr[0-9XYM][^_]\b" | awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' | sort -k1,1 -k2,2n -k3,3n | uniq > 2snp138+.bed && bgzip 2snp138+.bed && tabix 2snp138+.bed.gz

mdozmorov commented 10 years ago

Processing the whole database:

for file in find /path/to/database/ -type f -name "*.bed.gz"; do f=basename $file; d=dirname $file; echo $file; zcat $file | grep "\bchr[0-9XYM][^_]\b" |awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' | sort -k1,1 -k2,2n -k3,3n | uniq > $d/${f%???} && rm $file; bgzip ${file%???} && tabix $file; done

lkscara commented 10 years ago

This sounds good! It would need to be done before the optimizer is run, correct?

lkscara commented 10 years ago

To keep the stats consistent, would we not need to run this on the background and all uploaded files as well?

mdozmorov commented 10 years ago

This step is the last after the database creation, which includes "custom_data/fois, backgrounds, gfs". After filtering everything, one can run optimizer

mdozmorov commented 10 years ago

We keep it as a separate step of database post-processing, performed from command line