Metadata Shuffle-Groups Mapping to Original Group name

cherman2 commented 1 year ago

Improvement Description QIIME Metadata Shuffle should have some sort of mapping for what the fake group labels represent in original group labels

Current Behavior Currently, Metadata shuffle creates fake groupings that it labels Fake.group.n. However it is difficult to know what group the fake group is representing.

Example use-case: ANCOM-BC identified many Differentially abundant in a metadata group that has a very small sample size. I am unsure if this is because of my small sample size or because of a meaningful signal. I am using Metadata shuffle to test this but I need to know what fake group correlates to my original group with a small sample size.

Proposed Behavior Metadata shuffle should have an optional flag that appends the original metadata group names. Ex. instead of Fake.group.n it should be Fake.group.original-name.

ebolyen commented 1 year ago

I may be misremembering, but a random group doesn't relate to any group at all, all samples are shuffled into arbitrarily many groups with no consideration of the original label. It's basically a single bootstrap permutation.

cherman2 commented 1 year ago

I was under the impression that it kept the same group sizes though? I am using it to see if the group with a sample sample size will get a similar signal if the samples are shuffled but I need to be able to see which samples now belong to that small group.

ebolyen commented 1 year ago

Ah, you are right, the group sizes are preserved. I was thinking about this the wrong way around, the samples are arbitrarily assigned to the groups, but the group levels are mapped 1-1 against the originals such that a mapping could exist.

I think I'd prefer that anything suggesting the "realness" of this data is avoided, would it be sufficient to encode the size of the group for your purpose? I think that should be effectively the same thing, since the samples are random within a group anyway.

ebolyen commented 1 year ago

Or some mechanism to show the distribution of a given category, which could be useful outside of the permutation stuff?

Some generic MetadataColumn -> Barplot viz?

cherman2 commented 1 year ago

This was talked about in the dev meeting. We are using this as a good first issue.

Just to re-explain this issue: Metadata shuffle-groups shuffles your metadata column values randomly so you can test if your results are related to your metadata values or if they are just random noise. Currently when they are shuffled, there is no way to know which "random group" relates to our "true metadata value". This is so users do not accidentally use their shuffled metadata as real metadata later down the line. However, I had a use case where I needed to know which shuffled groups match to the real metadata group so that I could test if a metadata group with a small n was seeing real results, or random noise.

I think this could be a common use case of metadata shuffle-groups.

Current behavior:

Groups are named "fake.group.[1-n]", where n is the number of metadata values.

Proposed behavior:

I think we should add a flag like --p-encode-sample-size as a boolean, if it is passed the random group names also include the sample size.
Ex: fake.group.1.n=12

This would allow for users to know which "fake group" correlates with their small sample size group. It also still prevents users from accidentally using the shuffled metadata as real metadata.

qiime2 / q2-metadata

Metadata Shuffle-Groups Mapping to Original Group name #57