jmschrei / tfmodisco-lite

A lite implementation of tfmodisco, a motif discovery algorithm for genomics experiments.
MIT License
56 stars 16 forks source link

Add BED and FASTA output subcommands `seqlet-bed` and `seqlet-fasta` #29

Closed bytewife closed 1 year ago

bytewife commented 1 year ago

See below for examples

bytewife commented 1 year ago

Okay the output for both seqlet-bed and seqlet-fasta should be correct this time.

I've matched the generated FASTA with bedtools getfasta as follows:

$ bedtools getfasta -fi examples/ENCSR000EGM/data/hg38.fa -bed modisco_results.bed -fo test.fa.out

which outputs:

$ head test.fa.out
>chr8:106015452-106015481
ttcaagaatattaattagaatacaaatat
>chr8:28986534-28986563
AATTTGAAGGCTATCACCTATCTACAGAA
>chr8:46594384-46594413
AAAAACAAATAAACACATGAAAAACCTCt
>chr8:15668501-15668530
actagcacgtgagccctgcccacagggac
>chr8:33173027-33173056
TGGAAAGTTCTAACCCTTCCCATCATTCC

which aligns with the generated seqlet-fasta output:

$ modisco seqlet-fasta -i samples/set/spi1_modisco_results.h5 -o modisco_results.fasta -s samples/set/spi1.ohe.npz -p samples/set/peaks.bed --windowsize 2114 -c chr8

$ head modisco_results.fasta
>chr8:106015452-106015481 dir=- pattern_0.0
TTCAAGAATATTAATTAGAATACAAATAT
>chr8:28986534-28986563 dir=- pattern_0.1
AATTTGAAGGCTATCACCTATCTACAGAA
>chr8:46594384-46594413 dir=- pattern_0.2
AAAAACAAATAAACACATGAAAAACCTCT
>chr8:15668501-15668530 dir=- pattern_0.3
ACTAGCACGTGAGCCCTGCCCACAGGGAC
>chr8:33173027-33173056 dir=- pattern_0.4
TGGAAAGTTCTAACCCTTCCCATCATTCC

This matches with the file generated by seqlet-bed:

$ modisco seqlet-bed -i samples/set/spi1_modisco_results.h5 -o modisco_results.bed -p samples/set/peaks.bed --windowsize 2114 -c chr8

$ head modisco_results.bed
track name="pattern_0" description="TF-MoDISco pattern 'pattern_0' on the positive strand."
chr8    106015452       106015481       pattern_0.0     1000    -
chr8    28986534        28986563        pattern_0.1     1000    -
chr8    46594384        46594413        pattern_0.2     1000    -
chr8    15668501        15668530        pattern_0.3     753     -
chr8    33173027        33173056        pattern_0.4     1000    -
chr8    54507605        54507634        pattern_0.5     1000    -
chr8    128173422       128173451       pattern_0.6     1000    -
chr8    62795869        62795898        pattern_0.7     1000    -
chr8    62761840        62761869        pattern_0.8     1000    -