Closed mdozmorov closed 10 years ago
The following regular expression will select only the correct chromosomes
zcat gwasCatalog.bed.gz | grep "\bchr[0-9XYM][^_]\b" > gc_clean.bed
Additionally, ensure the end coordinates are larger than the start, a step also needed to be automated.
awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' gc_clean.bed | sort -k1,1 -k2,2n -k3,3n | uniq > 3gwasCatalog+.bed && bgzip 3gwasCatalog+.bed && tabix 3gwasCatalog+.bed.gz
Or, combining the two:
zcat snp138.bed.gz | grep "\bchr[0-9XYM][^_]\b" | awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' | sort -k1,1 -k2,2n -k3,3n | uniq > 2snp138+.bed && bgzip 2snp138+.bed && tabix 2snp138+.bed.gz
Processing the whole database:
for file in find /path/to/database/ -type f -name "*.bed.gz"
; do f=basename $file
; d=dirname $file
; echo $file; zcat $file | grep "\bchr[0-9XYM][^_]\b" |awk 'BEGIN {OFS="\t"} { if ( $3 <= $2) { print $1, $2, $2+1, $4, $5, $6 } else { print $0 } }' | sort -k1,1 -k2,2n -k3,3n | uniq > $d/${f%???} && rm $file; bgzip ${file%???} && tabix $file; done
This sounds good! It would need to be done before the optimizer is run, correct?
To keep the stats consistent, would we not need to run this on the background and all uploaded files as well?
This step is the last after the database creation, which includes "custom_data/fois, backgrounds, gfs". After filtering everything, one can run optimizer
We keep it as a separate step of database post-processing, performed from command line
Often tables contain genomic coordinates on chromosomes other than standard set, e.g.,
chr17_ctg5_hap1 chr1_gl000191_random chr1_gl000192_random chr4_ctg9_hap1 chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chrUn_gl000248
This lead to incorrect p-value calculations in some cases, making random overlaps significant.
How can we filter out such genomic features? Can think about calling awk from within dbcreator, but wonder if there is another way?