ulelab / peka

Find motifs enriched around prominent crosslinks
GNU General Public License v3.0
5 stars 2 forks source link

Working with cross-link sites identified using iCount with overlapping indicies #22

Closed a-thind closed 6 months ago

a-thind commented 9 months ago

I have been running iCount xlsites to identify cross-link sites using read quantification in my eCLIP sample replicates for PEKA. I selected read quantification as the input BAM files have been UMI-pruned, and the bam files PCR-deduplicated. I have noticed several overlapping indices between identified cross-link sites in each replicate. I have considered using other cross-link site detection such as htseq-clip. However, there isn't a column for cDNA numbers. How would you recommend I deal with the overlapping sites?

kkuret commented 8 months ago

Dear @a-thind!

I am very sorry for the late reply to your issue. Do I understand correctly that some sites from your replicates are overlapping? For PEKA, we recommend grouping the overlapping sites into a single BED entry and summing up their cDNA scores from both replicates.

Here is a useful bash one-liner for such file merging, which utilizes the groupby methond in bedtools - the inputs are gzipped BED files from replicates. You need bedtools installed for this, and don't forget to replace the filenames in the brackets with your input files and output file name :)

zcat <input.bed1> <input.bed2> | sort -k 1,1 -k2,2n -k3,3n -k6,6 | bedtools groupby -i stdin -g 1,2,3,6 -c 5 -o sum | awk '{OFS="\t"}{print $1,$2,$3,".",$5,$4}' | gzip > <output.combinedbed>

You can also use any other mode of table manipulation to combine your bed files in this way - it can be done for example in python, using pandas groupby

a-thind commented 8 months ago

Thank you for your reply. Yes, I am trying to merge multiple replicates together and then run PEKA, but I have noticed within a given replicate bed file of significant cross-link sites, there are overlapping sites, e.g.:

chr1 47389708 47389709 gene1-gene2 4 - chr1 47389709 47389710 gene1-gene2 1 -

I opted to use the read quantification mode instead of the cDNA mode as I encountered an error message stating that the randomer barcode was not found, perhaps due to UMI pruning? I would like to know if the cDNA mode is essential for cross-link site quantification or if the --quant readmode is also compatible with PEKA?

kkuret commented 8 months ago

Hi! PEKA can work with any of these quantification modes, as they essentially return the same format. Did you deduplicate your reads prior to pruning the UMIs? If so, then using --quant readsshould be fine.

If you did not deduplicate your reads, I would strongly suggest that you do - otherwise sites defined by reads with most PCR duplications will have highest count, instead of being proportionate to the amount of crosslinking in a given region.

Concerning PEKA, the format you posted is absolutely fine! These two entries do not overlap - one entry is ar position 47389708 and the other at position 47389709. These should not be merged, as PEKA works with single-nucleotide positions (i.e. crosslinks).

I'd be happy to answer any more questions. If PEKA gives you an error you can share the files you've used and the error, so I will be better able to help.

a-thind commented 8 months ago

Thank you for your advice. I have run PEKA with --quant reads, as the data providers have shared that the BAM files are PCR de-duplicated and used the merge replicate snippet to group the cross-link sites across replicates. I have shared a subset of the input files.

I think the overlapping error I received was due to an incompatible segment regions file, as now I can obtain output files with PEKA with the merged replicate bed files:

Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 14038129
Thresholding intron runtime: 23.54 min
Thresholding intergenic
lenght of df_reg for intergenic is: 2111382
Thresholding intergenic runtime: 7.66 min
Thresholding cds_utr_ncrna
lenght of df_reg for cds_utr_ncrna is: 4755368
Thresholding cds_utr_ncrna runtime: 7.04 min
Thresholding runtime: 38.59 min for 1718111 thresholded crosslinks
20897850 total sites. All sites taging runtime: 36.88 min
1499168 thresholded sites on whole_gene
17669323 all sites on whole_gene
Subsampling whole_gene crosslinks, 3000000 randomly selected crosslinks used.
noxn 2977305 on whole_gene
ntxn 59185 on whole_gene
Kmer positional counting runtime: 0.10 min
Reference positional counts runtime: 1.01 min
Number of possible k-mers: 4096
Probability (%) that k-mer was found at position in sampled roxn (rounded to 2 decimal points): 100.0
strict prtxn threshold: 90
relaxed prtxn threshold: 80
minimal percentage of 0-values across positions: 2.150634765625
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.0, 8.275862068965518
Analysing whole_gene runtime: 5.56
Analysing whole_gene in seconds per thresholded_crosslink: 0.005634747540879272
935887 thresholded sites on intron
14031719 all sites on intron
Subsampling intron crosslinks, 3000000 randomly selected crosslinks used.
noxn 2984009 on intron
ntxn 33601 on intron
Kmer positional counting runtime: 0.05 min
Reference positional counts runtime: 0.92 min
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.3448275862068966, 6.551724137931035
Analysing intron runtime: 4.60
Analysing intron in seconds per thresholded_crosslink: 0.008212164411132972
296788 thresholded sites on UTR3
1945638 all sites on UTR3
noxn 1887377 on UTR3
ntxn 24938 on UTR3
Kmer positional counting runtime: 0.05 min
Reference positional counts runtime: 0.53 min
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.0, 5.517241379310345
Analysing UTR3 runtime: 3.21
Analysing UTR3 in seconds per thresholded_crosslink: 0.0077143439180400305
266493 thresholded sites on other_exon
1691966 all sites on other_exon
noxn 1690366 on other_exon
ntxn 646 on other_exon
Kmer positional counting runtime: 0.00 min
Reference positional counts runtime: 0.48 min
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.0, 3.4482758620689657
Analysing other_exon runtime: 2.53
Analysing other_exon in seconds per thresholded_crosslink: 0.23496258332633382
124508 thresholded sites on ncRNA
1117764 all sites on ncRNA
noxn 1113367 on ncRNA
ntxn 2202 on ncRNA
Kmer positional counting runtime: 0.00 min
Reference positional counts runtime: 0.30 min
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.0, 0.6896551724137931
Analysing ncRNA runtime: 1.93
Analysing ncRNA in seconds per thresholded_crosslink: 0.05259881768845949
88038 thresholded sites on intergenic
2110763 all sites on intergenic
No thresholded sites found after intersecting with the peak file. Skipping region.
1711714 thresholded sites on genome
20897850 all sites on genome
Subsampling genome crosslinks, 3000000 randomly selected crosslinks used.
noxn 2979971 on genome
ntxn 61387 on genome
Kmer positional counting runtime: 0.11 min
Reference positional counts runtime: 0.94 min
prtxn confidence: 80
Found 5 clusters, optimized parameters: 0.0, 6.206896551724139
Analysing genome runtime: 5.51
Analysing genome in seconds per thresholded_crosslink: 0.005384728883177206
Analysis total runtime 98.87

However, I notice I have a few runtime warnings in my PEKA log file:

/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3257: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/data/home/.conda/envs/peka/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
kkuret commented 7 months ago

Dear Anisha, this is great news that you managed to get PEKA to run smoothly now. The runtime warnings are nothing to worry about - they happen because the code was written for an older version of numpy. I've resolved these warnings in the code, that is currently on the dev branch, but haven't yet merged the update to the main branch, because I also need to update the setup.py file to generate the bioconda package correctly. However, if you'd like to use the updated version of the code, you can simply install the code from the dev branch, and the errors you saw should go away.

a-thind commented 7 months ago

That's reassuring; thank you so much for your help!