sr320 / ceabigr

Workshop on genomic data integration with a emphasis on epigenetic data (FHL 2022)
4 stars 2 forks source link

Decide on minimum read count for exon expression #88

Closed kubu4 closed 9 months ago

sr320 commented 10 months ago

20x

kubu4 commented 10 months ago

Is this per exon or mean across all exons per gene ?

sr320 commented 10 months ago

I think we want to keep exon with low / no expression. .. Lets set a threshold as sum of read counts for first 6 exons (as this is what we are looking at) to be 1000.

kubu4 commented 10 months ago

Okay, when I do this (sum read coverage across first 6 exons per gene), I end up with only 2,497 genes having a sum of >= 1000.

< 1000      >= 1000 
  35763        2501
kubu4 commented 10 months ago

Can check out code here:

https://github.com/sr320/ceabigr/blob/a6a5f241329e6a0fd9a647ece32930832a5ccb91/code/65-exon-coverage.qmd#L357-L375

sr320 commented 10 months ago

what would the gene count be if reduced sum to 100.

kubu4 commented 10 months ago
< 100    >= 100 
  6510    31754
sr320 commented 10 months ago

Greater than 500?

kubu4 commented 10 months ago
 < 500     >= 500 
 31627       6637
sr320 commented 10 months ago

lets go forward with > 100

kubu4 commented 10 months ago

Alrighty, we may need to make further adjustments. Those numbers above were just from a single sample that I was using for code testing.

I've managed to write code to look at all the files and do the threshold filtering for all samples on a per gene basis. I.e. All samples must have an exon coverage sum threshold of n.

Threshold Genes
10 23101
25 18119
50 13827
75 10357
100 7485
500 385
kubu4 commented 10 months ago

Relevant code section:

https://github.com/sr320/ceabigr/blob/6b4bf89801a50180993331a927830d820b42e5e3/code/65-exon-coverage.qmd#L437-L486

sr320 commented 10 months ago

lets go with threshold of 10