microbiome / miaViz

Microbiome Analysis Plotting and Visualization
https://microbiome.github.io/miaViz
Artistic License 2.0
9 stars 10 forks source link

Adding Neatsort Function to miaViz #129

Open SHillman836 opened 3 weeks ago

SHillman836 commented 3 weeks ago

So the neatsort function acts on ordinated data and sorts before it's plotted in a heatmap, so hierarchical clustering isn't used which can often distort relationships. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-45 talks about this in great detail.

https://github.com/microbiome/miaViz/pull/128 is my pull request where you can see my initial implementation of this method. It uses radial theta angle sorting which I took from https://github.com/microbiome/microbiome/blob/master/R/neatsort.R

However, there are a few things to discuss - 1) Right now, this neatsort method acts on a raw matrix and returns the a vector containing the correct sorted order that the user can apply to their original data. Should it act on the TreeSE container aswell, and sort the original assay directly?

Right now the function could be called with neatsort(reducedDim(tse, "MDS")), and it returns a vector of row names you can apply to your original data with original_data[rowname_vector, ] - please see the example in the ./man/neatsort.rd file on the PR for more info.

But do we want to apply the function directly to a TreeSE object, or other object for that matter? And then apply the sorting directly to the assay of choice? It's around 2 lines extra for the user if they have to input a raw matrix.

2) In the testing, I used a slightly different structure than some of the other test files. I encapsulated the tests for every method I tested in it's own test_that block. I think the testing could be made more modular across some of the other methods. Thoughts on this?

@antagomir @TuomasBorman

TuomasBorman commented 3 weeks ago

Thanks!

  1. This should take TreeSE (or SingleCellExperiment) as an input,
  2. Unit tests are imporant and could be improved. However, I think the units tests should have similar structure in all methods. I am ok with this more structured approach, but then I think other tests should also be updated (that is bigger thing and requires different PR)
  3. As this is not plotting function, this should be implemented in mia.
  4. I am wondering what is the best approach. The function could also add order to rowData() in addition of sorting it.
  5. As this is new method, it should strictly follow Bioconductor guidelines of best practices. To check the code, run BiocCheck::BiocCheck(). More from here: https://contributions.bioconductor.org/ . Moreover, it should also follow the same coding style that is used in mia packages

These are my first thoughts, I can add couple of comments to PR even though mia is the right place

The usage of the function could look like this

tse <- scater::runPCA(tse)
tse <- neatsort(tse, reduceddim = "PCA")
SHillman836 commented 3 weeks ago

OK great, yes I've had a look at your comments and I'll implement them. Ah, ok, Leo said it was for miaViz but I can change it to mia. There is actually a neatsort method in mia, but it's very old and kind of bloated and tries to apply ordination depending on the object (https://github.com/microbiome/microbiome/blob/master/R/neatsort.R). So maybe this should just be changed? Or should I create a new method that does what was in this PR?

Ok yeah, let's think about unittest changes after this is implemented. Yeah I ran all those community guideline commands, but I'll go over coding style in mia and make changes to the style.

Yes I'll make it on a TreeSE object. Should I drop the matrix implementation altogether then?

TuomasBorman commented 3 weeks ago

That link is to microbiome package which is different from mia. The method is best to structure so that the internal function uses matrix. Then the matrix support for user could be easily added if needed.

Let's wait Leo's comments on the package (mia or miaViz)

SHillman836 commented 3 weeks ago

Ah my mistake, sounds good.

antagomir commented 3 weeks ago

Some additional points to consider. These touch key design decisions.

Neatsort is a way to sort samples sequentially by their similarity. Its basic form does 2D ordination for the data, then uses the theta angle method to sort samples based on this 2D ordination. However, this sample sorting is then (in typical use cases) used to sort samples in the original abundance matrix in heatmap visualizations. Hence ordination is just an intermediate step and in most cases one does not focus on sorting the ordination but the original data before the ordination (it could be interesting to visualize the sorted ordination on heatmap, too, but that's another thing..).

Ideally, then, we have a function (neatsort) that provides sorting for samples. This sorting can then be used to support visualization. Something like:

# Sort 
o <- neatsort(tse, "relabundance", ...)
# Plot sorted data
heatmap(tse[, o])

One might like to sort samples based on some other data than assays. Hence in my opinion the neatsort should also support matrices. It can support different structures (matrix, vector, (Tree)SE), like some other mia methods do as well. They often check the type of the object and if that is Tree/SE, they extract the assay and then pass that on to the core function which is often implemented for plain numeric matrices. Check some functions from mia for examples.

Neatsort does two things: a) data ordination; b) sorting of the ordinated data. There are many different ordination methods. This makes it more difficult to combine these into a single function. We probably want more generic approach, like

# Get the 2D ordinated data matrix with e.g. PCA but also MDS etc are possible
x <- (tse, ...)
# Sort samples based on that 2D ordination using theta angle method
s <- neatsort(x)
# Apply sorting on original data to create a nice heatmap
heatmap(tse[, s])

I am not sure if it makes sense to combine steps 1-2 for the above mentioned reasons. We would like to use neatsort as a generic method that can operate with any ordination matrix (or other 2D or even ND object where N is 1 or higher)

If the main purpose of neatsort is to support visualization then it might logically go to miaViz as well. Might help to reduce dependencies in mia (I think this had some new deps again..).

Unit tests & formatting issues can follow once these design details are fixed and the implementation ok.

TuomasBorman commented 3 weeks ago

Thanks for the clarification, now I also read the publication on those "neat maps", and the plan seems clearer.

  1. As this is only used in visualization, miaViz is fine.

  2. Is the "neatsort" established name? In R, sort() sorts the data, but order() gives the rank indices. Also, we have agreed with the common naming system: get outputs calculated values and add adds the values in TreeSE. This function could be getNeatorder as this is not sorting the data in a way that it means in base R.

  3. I would not apply ordination to the function itself. That complicates the function.

  4. I am fine also with the matrix function. For instance, unifrac method in mia is implemented in a way that supports both matrix and TreeSE.

These are not generic functions that should be used but shows the idea


neatsort <- function(matrix, ){}

neatsort <- function(tse, reduceddim){
    # Get the reducedDim
    matrix <- reducedDim(tse, reduceddim)
    # Get the order
    res <- neatsort(matrix)
}
antagomir commented 3 weeks ago
  1. "neatsort" is not fixed name, it was just used in some earlier packages and can be changed; could be getNeatOrder
  2. yes, agreed
  3. we can support TreeSE (and reducedDim) here but it is also very easy to just write getNeatOrder(reducedDim(tse, "MDS")), it might be good to think what is the actual added value supporting TreeSE for this kind of auxiliary function as it will also complicate the function and its maintenance. Having support for TreeSE is ok but we are likely to bump into the same question later again with some other auxiliary functions as well. It would be good to be clear on the added value/need.
SHillman836 commented 3 weeks ago

Thanks both for the comments and discussion. So am I correct in saying that the next steps are -

Then the only other thing which I'm not fully sure on is whether I should make a TreeSE implementation for this? We're definitely agreed on the way to call both the matrix implementation and TreeSE implementation if we add it, but yeah should I add it?

I think it adds 2 lines of code if you just have the matrix implementation. You have to call the function with getNeatOrder(reducedDim(tse, "MDS")), and then you have to apply the sorting order to the original assay data.

antagomir commented 3 weeks ago

TreeSE is just a few lines to add if the implementation is otherwise ready, so you could proceed already without it and then we can easily add it later as we can continue to discuss that. Would that work?

I would like to be more clear on justification for the TreeSE support before we implement it.

SHillman836 commented 3 weeks ago

Yes perfect, just pushed with new formatting changes - https://github.com/microbiome/miaViz/pull/128/ - so there may be one or two more changes Tuomas picks up but the implementation is pretty much done I think