ulelab / peka

Find motifs enriched around prominent crosslinks
GNU General Public License v3.0
5 stars 2 forks source link

test data works, but own data gives: ValueError: Overlapping IntervalIndex is not accepted. #23

Closed MWSprivate closed 7 months ago

MWSprivate commented 8 months ago

Hi there

Thanks for the program. I installed it and running the test data works fine. Moving to my own data (the first is what I understood from the documentation but all fail):

peka -i iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf

gives me an error:

Getting thresholded crosslinks Thresholding intron lenght of df_reg for intron is: 1038414 Traceback (most recent call last): File "/home/name/miniconda3/envs/peka/bin/peka", line 8, in sys.exit(main()) File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1462, in main set_seeds File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1051, in run df_txn = get_threshold_sites(sites_file, percentile=percentile) File "/home/name/miniconda3/envs/peka/bin/peka.py", line 497, in get_threshold_sites df_cut = cut_sites_with_region(df_reg, df_region) File "/home/name/miniconda3/envs/peka/bin/peka.py", line 422, in cut_sites_with_region df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m) File "/home/name/miniconda3/envs/peka/bin/peka.py", line 409, in cut_per_chrom df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p) File "/home/name/miniconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut raise ValueError('Overlapping IntervalIndex is not accepted.') ValueError: Overlapping IntervalIndex is not accepted.

Chromosome names and sorting seem to match and the files are iCount outputs (I removed some chromosomes and resorted while testing, but the original files also did not work). I uploaded my files:

https://e.pcloud.link/publink/show?code=XZCJGeZ4hXmsgv6hmLEf71rt8yzRhSgzqWk

What's wrong? :)

kkuret commented 8 months ago

Hi!

This occurs, because the features in your GTF regions file are overlapping - when you run iCount segment, on the annotation GTF, this produced the segmenation file that you have provided in your PEKA run, but it also produces another file called "regions.gtf.gz". This is the file that should be used for PEKA.

In regions.gtf the overlapping features such that a genomic interval is only assigned to one of these features: ncRNA, intergenic, CDS, UTR3, UTR5, intron. Note that there are no gene and transcript features anymore.

You can see how the file looks like in test data inputs: https://github.com/ulelab/peka/blob/main/TestData/inputs/sorted.regions.gtf.gz

Please let me know if this resolves your issue.

Cheers!

MWSprivate commented 8 months ago

Hi

Thanks for the help. I had a quick look, might be that my iCount version (2.0.0) doesn't match. It only produces one file (help message below). I'll check another version once I have some spare time.

Best regards

usage: iCount segment [-h] [-prog] [-S] [-F] [-P] [-M] annotation segmentation fai

Parse annotation file into internal iCount structure - segmentation.

Currently, only annotations from ENSEMBl and GENCODE are supported. http://www.gencodegenes.org/ http://www.ensembl.org

Segmentation is used in almost all further analyses.

In segmentation, each transcript is partitioned into so called regions/intervals. Such regions must span the whole transcript, but should not intersect with each other. However, higher hierarchy levels: transcripts and genes can of course intersect each other.

Example of possible segmentation::

Genome level: |---------------------------------------------------|

Gene level:    |--------------gene1--------------|   |-intergenic-|
                             |---------gene2--------|

Transcript l.: |----------transcript1---------|
                       |-------transcript2-------|
                             |------transcript3-----|

Region level:  |-CDS-||-intron-||-CDS-||-UTR3-|

For simplicity, only the partition of transcript1 is presented.

positional arguments: annotation Path to input GTF file segmentation Path to output GTF file fai Path to input genome_file (.fai or similar)

options: -h, --help show this help message and exit -prog, --report_progress Switch to show progress (default: False) -S , --stdout_log Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF. -F , --file_log Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF. -P , --file_logpath Path to log file. -M , --results_file File into which to store Metrics.

kkuret commented 8 months ago

Indeed the version discrepancy is likely the reason. I am running iCount-Mini v 3.0.1. It can be installed with pip or with conda. This should fix the issue :)