ulelab / peka

Find motifs enriched around prominent crosslinks
GNU General Public License v3.0
5 stars 2 forks source link

Example input files? #16

Closed grexor closed 1 year ago

grexor commented 1 year ago

Great work!

Trying to run PEKA on a dataset we have here, would it be possible to get an example dataset complete with small example files? (see below for requirements). Would greatly help to see exactly what kind of input (format) is accepted / required, thanks!, Gregor

required arguments:
  -i INPUTPEAKS, --inputpeaks INPUTPEAKS
                        CLIP peaks (intervals of crosslinks) in BED file
                        format
  -x INPUTXLSITES, --inputxlsites INPUTXLSITES
                        CLIP crosslinks in BED file format
  -g GENOMEFASTA, --genomefasta GENOMEFASTA
                        genome fasta file, ideally the same as was used for
                        read alignment
  -gi GENOMEINDEX, --genomeindex GENOMEINDEX
                        genome fasta index file (.fai)
  -r REGIONS, --regions REGIONS
                        genome segmentation file produced as output of "iCount
                        segment" function
kkuret commented 1 year ago

Dear Gregor!

Example test files are available in the github repository, in the TestData/inputs folder. In the TestData folder is also a script peka_run.sh, which shows which file is used together with each flag. Let me know if you have any more questions and I'll do my best to clarify them!

Brief description of files:

Both INPUTPEAKS and INPUTXLSITES are bed files in bed6 format. INPUTXLSITES represents crosslink positions obtained with CLIP and should contain the number of cDNA truncations mapped to specific positions in the score column. INPUTPEAKS are peaks called on crosslink sites, using the chosen peak-calling tool, and the value in the score column doesn't matter in the case of peaks.

GENOMEFASTA is a fasta file for genome assembly you're using and can be obtained from GENCODE or Ensembl. For example "Genome sequence, primary assembly (GRCh38)" on GENCODE website https://www.gencodegenes.org/human/.

GENOMEINDEX is a fasta index file generated with samtools index from GENOMEFASTA.

REGIONS is a gtf file, that stratifies genome into transcript regions: CDS, intron, 3UTR, 5UTR, ncRNA, and intergenic. This file is generated with iCount, using the "segment" function, from annotation gtf corresponding to reference genome (for example Comprehensive gene annotation (PRI) GTF on GENCODE website https://www.gencodegenes.org/human/).

Easy access to files related to genome: Files related to genome, aka GENOMEFASTA, GENOMEINDEX and REGIONS can be obtained for multiple organisms on iMaps web server for analysis of CLIP data: https://imaps.goodwright.com/genomes/ https://imaps.goodwright.com/genomes/, removing the need to run samtools index or iCount segment yourself.

Here are the relevant files for hg38 reference genome, available from iMaps, that one can use with PEKA: GENOMEFASTA - https://imaps.goodwright.com/data/763262578127/ GENOMEINDEX - https://imaps.goodwright.com/data/287791143752/ REGIONS - https://imaps.goodwright.com/data/779414602441/

V V tor., 29. nov. 2022 ob 10:08 je oseba Gregor Rot < @.***> napisala:

Great work!

Trying to run PEKA on a dataset we have here, would it be possible to get an example dataset complete with small example files? (see below for requirements). Would greatly help to see exactly what kind of input (format) is accepted / required, thanks!, Gregor

required arguments: -i INPUTPEAKS, --inputpeaks INPUTPEAKS CLIP peaks (intervals of crosslinks) in BED file format -x INPUTXLSITES, --inputxlsites INPUTXLSITES CLIP crosslinks in BED file format -g GENOMEFASTA, --genomefasta GENOMEFASTA genome fasta file, ideally the same as was used for read alignment -gi GENOMEINDEX, --genomeindex GENOMEINDEX genome fasta index file (.fai) -r REGIONS, --regions REGIONS genome segmentation file produced as output of "iCount segment" function

— Reply to this email directly, view it on GitHub https://github.com/ulelab/peka/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANVVEQ3TJRN33BGVFITAMHLWKXBZLANCNFSM6AAAAAASOHNKTA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

grexor commented 1 year ago

Dear Klara, thanks so much! This is super useful, will test it out asap and report here, looks I have everything I need now, Cheers, Gregor

grexor commented 1 year ago

Thanks again, works! Cheers, Gregor

grexor commented 1 year ago

Pretty cool about the Git LFS solution to pull the GRCh38.p12.genome.masked.fa.gz file, I didn't know this one and needed time to figure out this would download it for me (perhaps to add to docs?):

git lfs install
git lfs pull

Then one needs to unzip GRCh38.p12.genome.masked.fa.gz, and index it with samtools faidx GRCh38.p12.genome.masked.fa.

Because in the peka_run.sh you are using the .fa file (-g 'inputs/GRCh38.p12.genome.masked.fa'), and in the repo you have the .fa.gz file. Not so important, just for you to know in case you want to consolidate this. Nice! (please feel free to close this issue anytime).