PEKA on intronless genomes

sykorami commented 1 year ago

When I try to run PEKA I get this message:

"Getting thresholded crosslinks Thresholding intron Not able to find any thresholded sites in your sample (NoneType). Exiting."

I have tried this with iCount xlsites and both iCount peak and Clippy peaks as inputs. Is this because my GTF has no annotated introns? I have a custom GTF file with only gene on 3rd collumn.

sykorami commented 1 year ago

If the GTF contains at least "intron, intergenic and CDS" in the 3rd collumn, PEKA moves on. However, after detection of noxn and ntxn I am now stuck with another error message:

Traceback (most recent call last): File "peka.py", line 1461, in main() File "peka.py", line 1457, in main set_seeds File "peka.py", line 1125, in run reference, genome, genome_chr_sizes, window + kmer_length, window + kmer_length, merge_overlaps=False File "peka.py", line 543, in get_sequences return [line.split("\t")[1].strip() for line in open(seq_tab.seqfn)] File "peka.py", line 543, in return [line.split("\t")[1].strip() for line in open(seq_tab.seqfn)] IndexError: list index out of range

kkuret commented 1 year ago

Hi! Indeed having a GTF without all the specified regions, causes issues in PEKA, as it will try to generate results for composite regions, such as "whole_gene", which comprises of introns, CDS and UTRs. The second error is related to sequence extraction from the genome, but I don't know why it occurred - I suppose it occurred in the intronic region? Can you please explain how you added introns into the GTF and whether you also added crosslinks to intronic sites? Could you please paste the full output that was printed before the second error? If you share your input files with me (you can upload them to dropbox or google drive and send a link to klara.kuret@ki.si), I will try to reproduce the error and add a fix. Thanks for your report!

sykorami commented 1 year ago

Hi! I have changed the last intergenic region to intron region in the GTF just to have the intron requirement fullfilled. I want this to run on the whole genome (as UTRs are not annotated on the GTF). I have shared the input files with you in here.

The error message prior the traceback was: Getting thresholded crosslinks Thresholding intron lenght of df_reg for intron is: 96 Thresholding intron runtime: 0.00 min Thresholding intergenic lenght of df_reg for intergenic is: 121112 Thresholding intergenic runtime: 0.11 min Thresholding cds_utr_ncrna lenght of df_reg for cds_utr_ncrna is: 439304 Thresholding cds_utr_ncrna runtime: 0.32 min Thresholding runtime: 0.43 min for 125120 thresholded crosslinks 559762 total sites. All sites taging runtime: 0.42 min 125120 thresholded sites on genome 559762 all sites on genome noxn 411358 on genome ntxn 55340 on genome

kkuret commented 1 year ago

Thank you! I'll have a look

kkuret commented 1 year ago

HI! I found the cause for the second error - your fasta file was formatted as ASCII text, with CRLF line terminators. This means that the sequence was wrapped with nonstandard a line separator "^M$", like so:

SaciDSM639^M$ GGGGAGTTAAATGACTGTAATCCCTATTTCAAAATAGTTTGAGGTAACACCATTGGAAATTATTGTAAAC^M$ TAAAATAACGTATGAAACATATGTATTCATGTGGCAATTAGGGAAACGCTGAAAGGTGGCAAAGGGGAGG^M$

This formatting causes a faulty output in bedtools getfasta and this is why PEKA errors when it tries to process it. The error was resolved by changing your fasta file to UNIX format, by running dox2unix your-fasta-file

Please, try changing your fasta file to UNIX format, run PEKA with it, and let me know if it resolved your issue.

As for the first error you reported regarding the need to have all regions annotated in the GTF, I'll implement a fix in the future, so there won't be a need to make "hackish" adjustments to input files.

sykorami commented 1 year ago

Hi Klara! Seems like the formatting of the fasta file did the job. Thanx a lot!

ulelab / peka

PEKA on intronless genomes #17