mdozmorov / genome_runner

Academic Free License v3.0
0 stars 3 forks source link

Uniqueness of file names #74

Closed mdozmorov closed 9 years ago

mdozmorov commented 9 years ago

Each file name in the database has to be unique. It is used as a "key" to pull up Cell/Factor information from the gfAnnot table.

Right now, split genomic features are named the same for multiple cell types, e.g., for Encode, chromState '1ActivePromoter.bed.gz' exists for Gm12878, K562 etc. We need to make them unique, by explicitly adding cell type. E.g., 'Gm12878-1ActivePromoter-BroadHMM.bed.gz'. The same for Roadmap, like 'E033-1TssA-chromStates18.bed.gz'. Note all parts that make the name unique!

For DNase, we have 'E092-DNase_hotspot_all_peaks.bed.gz' for both 'processed_broadPeak' and 'processed_gappedPeak'. We need to make names unique, 'E092-DNase_hotspot_all_peaks_bPk-processed.bed.gz'. The same for Histone_processed - attach peak type to filenames.