stjude-biohackathon / CRCminer

MIT License
2 stars 1 forks source link

Identify TF cliques & output them. #9

Closed j-andrews7 closed 1 year ago

j-andrews7 commented 1 year ago

This is actually probably the part I am least sure about.

Probably worth reading the networkx docs on cliques.

Input will be a networkx network object (probably?, depends on Stage 3 choice) composed solely of SE-associated TF genes. No downstream targets.

Output in format like (A549_CRCs_CLIQUE_SCORES_DEGREE.txt from CRC):

['SMAD3', 'JUNB', 'STAT3', 'FOSL2', 'ETS1', 'STAT1']    42.833333333333336
['SMAD3', 'JUNB', 'STAT3', 'EHF', 'PBX1', 'STAT1', 'TEAD1', 'ETS1'] 41.75
['SMAD3', 'SOX12', 'STAT3', 'RXRA', 'DBP', 'FOSL2', 'ETS1', 'STAT4', 'ZNF143', 'STAT1'] 41.7
['SMAD3', 'JUNB', 'STAT3', 'FOSL2', 'ETS1', 'GLI2'] 41.666666666666664
['SMAD3', 'SOX12', 'STAT3', 'TCF7L2', 'DBP', 'FOSL2', 'ETS1', 'STAT4', 'STAT1', 'ZNF143']   41.4
['SMAD3', 'SOX12', 'STAT3', 'RXRA', 'DBP', 'FOSL2', 'ETS1', 'GLI2'] 41.375
['SMAD3', 'SOX12', 'STAT3', 'RXRA', 'DBP', 'EHF', 'PBX1', 'ETS1', 'ZNF143', 'STAT4', 'STAT1']   41.36363636363637

Last column is out-degree score.

This actually brings up a decision point - in CRC, they use only the top 100 cliques by out-degree score to calculate the clique proportions. The cliques seem largely redundant, but I don't know how likely this is to miss variable CRC members between samples or groups. May be worth providing a parameter to allow tweaking of this threshold.