mdozmorov / genome_runner

Academic Free License v3.0
0 stars 3 forks source link

dbcreator_encode: K562 bug #93

Open mdozmorov opened 9 years ago

mdozmorov commented 9 years ago

K562 is the cell line that is parsed out of the file names. But sometimes it is labeled as K562b or as K562E, URLs below. These are depreciated cell line names, and are the same as K562.

Can we hard-code these exceptions, so the files are downloaded with the original names but processed specially. E.g. wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz - should be K562-H3k27me2-Sydh wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz - should be K562-Fos-Uchicago

All noted bugs: hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me1UcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me3bUcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k9acbUcdPk.narrowPeak.gz

hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Egata2UniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Ehdac8UniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjunbUniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjundUniPk.narrowPeak.gz

mdozmorov commented 9 years ago

At least for the K562E bug, we need to hard code these particular files. Simply repl6acing K562E with K562 will break the "K562Ezh2" files, where Ezh2 is a legitimate factor.

So, even in the wrong files, not only the K562E should be K562, but also the factors should be capitalized, e.g., Fos.

mdozmorov commented 9 years ago

In re K562b - we can rename the files. There is no factors that start with lowercase "b". But there are some factors starting with capital "B", e.g., Btf3. So, if the split algorithms in the dbcreator are case-sensitive (to my knowledge, yes), we should be good with renaming K562b to K562

mdozmorov commented 9 years ago

Won't fix. Temporary solution - delete 'K562b' folders as they contain files the same as in regular 'K562'. The 'K562E' will be processed into correct 'K562' folder, but factor will be like 'Efos' - this can be corrected manually in the 'gf_description' file.

mdozmorov commented 8 years ago

Check for duplicates:

for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | sort | uniq | wc -l

should be equal to

for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | wc -l

The "K562E" error is fixed. The "K562b" folders should be manually deleted using find . -type d -name "K562b" -exec rm -r {} \;.

For hg19, 19,776 GFs become 19, 771 after removing duplicates.

The duplicates are:

  2 K562-H3k9acb-SydhHistone.bed.gz
  2 K562-H3k4me3b-SydhHistone.bed.gz
  2 K562-H3k4me1-SydhHistone.bed.gz
  2 K562-H3k27me3b-SydhHistone.bed.gz

Filter grsnp_db/hg19/gf_descriptions.txt file:

  grep -v "/K562b/" gf_descriptions.txt  > gf_descriptions_nodups.txt