suhrig / pingpongpro

Find ping-pong signatures in piRNA-Seq data like a pro
Other
5 stars 2 forks source link

FDR groupings? #4

Open OwenWato opened 3 years ago

OwenWato commented 3 years ago

I am interested to know why the FDR values outputted in both the ping_pong_signatures and tranposons files are grouped and not unique for each individual signature. For example one results file has 240 identified pp signatures but of the 240 each have an FDR of either 0.7, 0.14, 0.46 or 0.64.

testes_ping-pong_signatures.txt

suhrig commented 3 years ago

The FDR is calculated empirically from background noise. PingPongPro assigns background signatures to one of 4000 buckets according to their properties. The theoretical maximum resolution for the FDR is therefore a step size of 1/4000 = 0.00025. This is only the theoretical maximum. In practice, not all 4000 bins may be filled, because some buckets may be empty. The number of signatures in your sample is low. Probably, this is also the case for the background signatures. Without seeing the data, it is hard to determine why. Is it a sample with shallow sequencing depth by any chance? Anyhow, the low level of background signatures explains the step-function nature of the FDR values.

OwenWato commented 3 years ago

I am running it using merged bam files, where I merge/group samples by tissue type. Reads are mapped to the honeybee genome which has evidence for pp biogenesis- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5452642/. But perhaps I will need to run on individual samples and merge the output from your tool, rather than run on the merged bam files which may increase overall background signatures. The sequencing depth is quite good for each individual sample

suhrig commented 3 years ago

Hm, merging the BAM files should increase the sequencing depth and therefore increase the background signatures. So I would not expect that running the samples separately would make things better, but it's worth a try.

Another explanation could be that the libraries have a high duplication rate, resulting in poor coverage of the piRNA clusters.