michaelgruenstaeudl / PACVr

Plastome Assembly Coverage Visualization in R
Other
3 stars 4 forks source link

Generic `printCovStats` #29

Closed alephnull7 closed 5 months ago

alephnull7 commented 5 months ago

The functions verboseInformation and writeTables have been renamed printCovStats and printCovValsAsTable, respectively.

Due to previous work I had done, specifically with the creation and use of the source sequence as quadripRegions when no region partitioning is done, the extension of printCovValsAsTable to include such cases fit in with the way data was already being passed around. To better understand the processes performed in the function and facilitate better maintenance in the future, I refactored and broke up printCovValsAsTable into smaller helper functions, and in the process migrated all of these components of printCovStats into a new file verboseInformation.R. As of now, this is the only function called from PACVr.verboseInformation besides checkIREquality (currently located in IROperations.R), but due to the purpose and amount of code involved, it seemed necessary. On a similar note, a long overdue migration of code related to transforming read.gb data into forms used by PACVr has occurred, and is now housed in read.gb2PACVr.R. In addition to the above refactoring of printCovValsAsTable, minor changes to the code involved were done, in the service of having the outputted coverage data's column names relate to what "regions" are being analyzed. As of now, this handles the case of quadripartite regions, where the terminology "Chromosome" is used, and when the entire source is used, the term "Source" is applied. When testing these changes to printCovValsAsTable and the above two cases, apart from the changes in column names and the names used for the sequences, the only difference I could see was in the lowCoverage column for ir_regions. This aligns with my understanding of the coverage analysis done, that except for the assignment of ir_regions$lowCoverage using cov_regions, the region data appears to be used for only referencing, naming, and grouping. That is, it seems that is the only place where the respective subset a sequence is a part of is taken into account or could be taken into account. As a result, the threshold used for ir_regions$lowCoverage is different when the subset is the entire source, compared to what the threshold is when the subset is the quadripartite region the sequence is a part of.

Edit: Due to factoring the creation of the verbose files directory into a separate function getVerbosePath, a distinction between printCovStats and printCovValsAsTable no longer made sense, so the helper function printCovValsAsTable no longer exists.

michaelgruenstaeudl commented 5 months ago

Hi Greg, Your recent edits are excellent! Yes, breaking printCovValsAsTable into smaller, more readable functions is a great, which will make debugging – should it ever become necessary – much easier. Also, your interpretation about the region information (i.e., the quadripartite structure info) is correct: it is only used for referencing, naming, and – especially – grouping the coverage values.