morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Collapse redundant overlapping aSHM regions #147

Closed rdmorin closed 8 months ago

rdmorin commented 1 year ago

There are redundant aSHM regions in our bundled set of regions. These need to be collapsed to the most representative one for each region to avoid double-counting etc. These should be flagged in the comments of this issue and someone needs to remove the redundant ones. This should really all be done using the "master" list that now lives in this repo.

To get things started, here's a few examples:

chr_name hg19_start  hg19_end gene    region     regulatory_comment name                  
chr3      186783199 186784291 ST6GAL1 intronic-1 strong_enhancer    ST6GAL1-intronic-1 #fully contained within ST6GAL1-TSS-1    
chr3      186739628 186740875 ST6GAL1 TSS-1      active_promoter    ST6GAL1-TSS-1      
chr3      186781067 186784816 ST6GAL1 intronic-2 strong_enhancer    ST6GAL1-intronic-2 #fully contained within ST6GAL1-TSS-2 
chr3      186737482 186741455 ST6GAL1 TSS-2      strong_enhancer    ST6GAL1-TSS-2     
rdmorin commented 1 year ago

There is one larger region for PAX5 that is redundant with two smaller ones and should be removed (indicated below).

chr_name hg19_start hg19_end gene  region            regulatory_comment              name          size
chr9       37023396 37027663 PAX5  intron-1          intronic                        PAX5-intron…  4267
chr9       37369209 37372160 PAX5  distal-enhancer-1 enhancer                        PAX5-distal…  2951
chr9       37382267 37385854 PAX5  distal-enhancer-2 enhancer                        PAX5-distal…  3587
chr9       37395932 37409239 PAX5  distal-enhancer-3 enhancer                        PAX5-distal… 13307 
chr9       37032576 37037704 PAX5  TSS-1             active_promoter                 PAX5-TSS-1    5128
chr9       37021323 37043181 PAX5  TSS-2             active_promoter-strong_enhancer PAX5-TSS-2   21858 #remove this
rdmorin commented 1 year ago

This is a trickier one to consolidate. Removing the larger region that encompasses the two others would mean no coverage for one place that is commonly mutated. Perhaps this large region should be split into three or four consecutive regions?

  chr_name hg19_start hg19_end gene   region   regulatory_comment name             size
chr9       37192080 37207549 ZCCHC7 intron-3 intronic           ZCCHC7-intron-3 15469
chr9       37282092 37309161 ZCCHC7 intron-2 intronic           ZCCHC7-intron-2 27069
chr9       37323113 37340687 ZCCHC7 intron-1 intronic           ZCCHC7-intron-1 17574
chr9       37276495 37342191 ZCCHC7 intronic NA                 ZCCHC7-intronic 65696 #split up and remove redundant overlapping regions from above?
rdmorin commented 1 year ago

LPP has a few massive regions. I think these should also be split up. Perhaps one representing the "hottest" area and a few spanning the rest?

  chr_name hg19_start  hg19_end gene  region   regulatory_comment name           size
chr3      187771678 187982852 LPP   TSS-1    NA                 LPP-TSS-1    211174 # split into at least 3 regions
chr3      188377178 188491248 LPP   intronic NA                 LPP-intronic 114070 #split into at least 3 regions
rdmorin commented 1 year ago

Another huge one that should probably be split up

chr_name hg19_start hg19_end gene  region regulatory_comment name       size
 chr5       88131209 88206620 MEF2C TSS    active_promoter    MEF2C-TSS 75411
Kdreval commented 1 year ago

This is now fixed in this update but I will hold off to closing this issue until GAMBLR is set up to work with that designated data storage repo

Kdreval commented 8 months ago

The new setup is already tested for awhile without reported issues. Closing the issue as completed