Open mdozmorov opened 9 years ago
At least for the K562E bug, we need to hard code these particular files. Simply repl6acing K562E with K562 will break the "K562Ezh2" files, where Ezh2 is a legitimate factor.
So, even in the wrong files, not only the K562E should be K562, but also the factors should be capitalized, e.g., Fos.
In re K562b - we can rename the files. There is no factors that start with lowercase "b". But there are some factors starting with capital "B", e.g., Btf3. So, if the split algorithms in the dbcreator are case-sensitive (to my knowledge, yes), we should be good with renaming K562b to K562
Won't fix. Temporary solution - delete 'K562b' folders as they contain files the same as in regular 'K562'. The 'K562E' will be processed into correct 'K562' folder, but factor will be like 'Efos' - this can be corrected manually in the 'gf_description' file.
Check for duplicates:
for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | sort | uniq | wc -l
should be equal to
for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | wc -l
The "K562E" error is fixed. The "K562b" folders should be manually deleted using find . -type d -name "K562b" -exec rm -r {} \;
.
For hg19, 19,776 GFs become 19, 771 after removing duplicates.
The duplicates are:
2 K562-H3k9acb-SydhHistone.bed.gz
2 K562-H3k4me3b-SydhHistone.bed.gz
2 K562-H3k4me1-SydhHistone.bed.gz
2 K562-H3k27me3b-SydhHistone.bed.gz
Filter grsnp_db/hg19/gf_descriptions.txt file:
grep -v "/K562b/" gf_descriptions.txt > gf_descriptions_nodups.txt
K562 is the cell line that is parsed out of the file names. But sometimes it is labeled as K562b or as K562E, URLs below. These are depreciated cell line names, and are the same as K562.
Can we hard-code these exceptions, so the files are downloaded with the original names but processed specially. E.g. wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz - should be K562-H3k27me2-Sydh wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz - should be K562-Fos-Uchicago
All noted bugs: hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me1UcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me3bUcdPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k9acbUcdPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Egata2UniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Ehdac8UniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjunbUniPk.narrowPeak.gz hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjundUniPk.narrowPeak.gz