rkweku / miRador

Plant miRNA identification tool that utilizes a variety of filters to validate predicted miRNAs
GNU General Public License v3.0
5 stars 2 forks source link

RPM normalization may penalize low accumulated smallRNAs #12

Open thalescherubino opened 5 months ago

thalescherubino commented 5 months ago

Hi Reza,

I've noticed that in this version of miRador, RPM normalization is applied by default. While this normalization can be useful, it may inadvertently remove low-expressed fragments that hold meaningful biological information. For instance, imagine an input library of approximately 20 million reads being downscaled to 1 million during normalization.

During this process, reads with low counts, such as 5 before normalization, would become 0.25. I'm concerned that the Python implementation may round this value to 0, resulting in the loss of information. This could pose a problem if users intend to follow Blake and Axtell's criteria for miRNA annotation, as the miRNA* is typically lowly expressed and its presence is required for certain analysis packages.

I have two suggestions to address this issue:

(i) Consider rounding any non-zero value (e.g., 0.25) up to 1 to prevent data loss.

(ii) Allow users to define a normalization factor, such as RP30M, while preventing them from inflating or deflating it excessively based on actual library sizes. Additionally, miRador could potentially select the most appropriate normalization factor automatically.

Best regards, Thales

rkweku commented 5 months ago

Hello Thales,

Thank you so much for writing and thank you for this suggestion. I'll look into applying these suggested changes in the near future and will reply back once I am able to implement.

Best, Reza