jvanheld commented 4 months ago

Some data types associate a score to each sequence. This is for example the case for PBM, CSH, ...

We could

sort the sequences by scores,
run motif discovery (in particular oligo-analysis) successively in the top 100, 200, 300, 500, 1000, 2000, ... sequences
identify the number of sequences that corresponds to the highest over-representation (sig score)
use these sequences to build a motif (using peak-motifs)

PBM example

(base) [jvanhelden@core-login1 IBIS_2024]$ awk '{print $6"\t"$8}' data/leaderboard/train/PBM/LEF1/QNZS_PBM14333.tsv | sort -nr -k 2  | head -n 10
GGATCTCTTTGATGTCCATAACCCGGCGGCTCCAC     9.649701517094455
ACCCTCCTTTCATCTATAAAGTTGAGCAGTATTTA     9.103033329369557
CTCTTTTCCGAGTCTAACATCCGAGGAATACCTAG     9.007443079654322e-06
GTCGTTTGATCTTGCGTGACCAGTCGCTAGAACAA     8.535642168189854
CAAAATAGTCCTTTGAAGAGTCCGTGAGCACTGCG     7.787498419943953
TTCAGGAGAGGACCAAAACTCTTTGATGTGTACTC     7.687981697937926
ACTGCGTTTGAAGTTATTGCCCCTAGGCTGGGCCA     7.3459739552436
TAGGAGATGAAAGACTCGTTCCCCGCCCGGGCACA     7.145635715385344
TATTCCTCTAGACGCGCCGCTCTTTGATCTGCGCG     6.821724786793346
TCTTATGAGATGAAAGGAAGATGACCTATTATGCA     6.615726015624582

CHS

(base) [jvanhelden@core-login1 IBIS_2024]$ sort -nr -k 8  data/leaderboard/train/CHS/PRDM5/THC_0307.Rep-DIANA_0293.peaks | head
chr19   36378509        36379846        36379302        271     492.4332        105.47755       482.94229       out_peak_22419  cpics,gem,sissrs
chr1    94925377        94928455        94927095        271     461.45779       89.0151 453.88071       out_peak_2580   gem,sissrs
chr1    224356204       224357890       224357274       277     460.64499       84.18271        453.10345       out_peak_4353   cpics,gem,sissrs
chr13   108217892       108219057       108218441       251     438.99512       92.80537        431.78751       out_peak_11991  cpics,gem,sissrs
chr19   44211825        44213232        44212576        280     428.41055       67.45536        421.3558        out_peak_22839  cpics,gem,sissrs
chr5    95858580        95859326        95858923        236     425.3161        97.55226        418.30948       out_peak_33392  cpics,gem,sissrs
chr19   37370498        37371568        37371075        263     407.05875       68.41267        400.30344       out_peak_22459  cpics,gem,sissrs
chr17   17749780        17750955        17750199        195     380.21347       106.12191       373.68494       out_peak_17450  cpics,gem,sissrs
chr7    151078531       151084353       151083507       195     354.78024       91.4361 348.38312       out_peak_39067  cpics,gem,sissrs
chr1    232629275       232631503       232630883       204     350.22318       81.57689        343.85416       out_peak_4608   cpics,gem,sissrs

jvanheld commented 4 months ago

Done

file preprocessing: TSV files are sorted by decreasing value of mean signal intensity, and converted to fasta sequences. In the fasta file, the header row of each oligonucleotide contains detailed information (spot ID, position, intensity, rank) to ease post-processing.
top sequence selection: selection of the 250, 500 and 1000 top sequences for each dataset
background sequence selection: the 35,000 bottom sequences are used to build a background model, against which we test for over-represented k-mers (oligos) and dyads
motif discovery with differential analysis: motif discovery with peak-motifs in differential mode (top sequences versus background sequences)

jvanheld commented 4 months ago

@brunocontrerasmoreira , for info

Impact of the choice of the N top sequences

So far we tested 3 values for the number of top sequences to keep : 250, 500, 1000. The top-scoring motifs discovered are very similar, and their significance (k-mer over-representation binomial significance) increases from 250 to 500 and from 500 to 1000.

ROR B with 250 top -ranking PBM spots

ROR B with 500 top -ranking PBM spots

ROR B with 1000 top -ranking PBM spots

jvanheld commented 4 months ago

Choice of the normalisation method

PBM data are provided with 2 normalisation methods: SD or QNZC (z-scores). The signal intensities and the ranking of the spots show important differences depending on the signal normalisation method. We tested motif discovery with both approaches.

For NACC2, the motif discovery results are quite different. With SD normalisation, the sequence logos show reasonably good motifs :

Motifs discovered in the 1000 top-ranking spots of NACC2 QNZS dataset

Motifs discovered in the 1000 top-ranking spots of NACC2 SD dataset

Albeit both datasets return significant motifs, with SD the logos show high error bars and very irregular successions of high- and low-scoring columns in terms of information content.

jvanheld commented 4 months ago

This effect seems to depend on the TF : it is not observed with RORB or TIGD3

jvanheld / IBIS_2024

Matrioskas approach #4

Impact of the choice of the N top sequences

ROR B with 250 top -ranking PBM spots

ROR B with 500 top -ranking PBM spots

ROR B with 1000 top -ranking PBM spots

Choice of the normalisation method

Motifs discovered in the 1000 top-ranking spots of NACC2 QNZS dataset

Motifs discovered in the 1000 top-ranking spots of NACC2 SD dataset