microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
50 stars 27 forks source link

agglomerate by rank duplicated taxa names #617

Closed li-fangyeo closed 4 months ago

li-fangyeo commented 4 months ago

Hi again.

May I know what I am doing wrong here? I am trying to agglomerate by rank but no matter if i do it per taxa level or with this agglomerateByRanks , I get duplicates of the species (see rownames). How do I get rid of that?

package.version('mia') [1] "1.13.34"

read in abundance table from Metaphlan

tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance")

read in abundance table from Metaphlan

tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance")

attach metadata to tse

colData(tse) <- DataFrame(meta)

agglomerate by rank

tse <- agglomerateByRanks(tse) altExp(tse, "species") class: TreeSummarizedExperiment dim: 6079 42 metadata(1): agglomerated_by_rank assays(1): relabundance rownames(6079): sAbiotrophia_defectiva sAbiotrophia_defectiva_1 ... sZoogloea_SGB41465 sZoogloea_SGB41465_1 rowData names(9): kingdom phylum ... strain clade_name colnames(42): J02 J04 ... TRk125M TRk136M colData names(14): Code Group ... Age AgeGroup reducedDimNames(0): mainExpName: NULL altExpNames(0): rowLinks: NULL rowTree: NULL colLinks: NULL colTree: NULL

Best, Li-Fang

antagomir commented 4 months ago

@Daenarys8 do you see a solution here?

Daenarys8 commented 4 months ago

Hi @li-fangyeo , I would like to reproduce this problem you are facing. Can you share the dataset file? direct_merged.txt

antagomir commented 4 months ago

If the data is sensitive you can replace it with similar toy data that generates the same error, and share by email.

li-fangyeo commented 4 months ago

Is there an email I can send to? I received a message saying the file size was too large.

On Mon, Jul 29, 2024 at 9:39 AM Leo Lahti @.***> wrote:

If the data is sensitive you can replace it with similar toy data that generates the same error, and share by email.

— Reply to this email directly, view it on GitHub https://github.com/microbiome/mia/issues/617#issuecomment-2255060125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWWR3KZVYOMPTWPF35JX7ADZOXPSTAVCNFSM6AAAAABLQKZGHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVGA3DAMJSGU . You are receiving this because you were mentioned.Message ID: @.***>

antagomir commented 4 months ago

This kind of thing should be shared as a text/csv file instead of copypasted code.

antagomir commented 4 months ago

For large files you can use https://filesender.funet.fi/

li-fangyeo commented 4 months ago

Oh dear, I didn't realise it looks different on github. Because I attached it as a text file in the email.

direct_merged-toy.txt

read in abundance table from Metaphlan

tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance")

agglomerate by rank

tse <- agglomerateByRanks(tse) altExp(tse, "species")

Daenarys8 commented 4 months ago

I am unable to reproduce the duplicates you mentioned above.

> anyDuplicated(rownames(tse))
[1] 0

Oh dear, I didn't realise it looks different on github. Because I attached it as a text file in the email.

direct_merged-toy.txt

read in abundance table from Metaphlan tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance") #agglomerate by rank tse <- agglomerateByRanks(tse) altExp(tse, "species")

Perhaps, I didn't get the problem.

li-fangyeo commented 4 months ago

Hi!

The rownames are not duplicated but instead they are named as for example "s__Abiotrophia_defectiva" and "s__Abiotrophia_defectiva_1" so on.

antagomir commented 4 months ago

Some observations.

1) It seems to me that you have that issue with rownames already with original data, not just after agglomeration

tse0 <- importMetaPhlAn("direct_merged-toy.txt",remove.suffix = TRUE, assay.type = "relabundance")
rownames(rowData(tse0)[grepl("Abiotrophia_defectiva", rowData(tse0)$Species),])

[1] "s__Abiotrophia_defectiva" "s__Abiotrophia_defectiva_1"

2) Agglomeration could not collapse these into a single category because these two rows differ in terms of Kingdom information even when Species is the same. It is not good default behavior to merge two Species if they differ in higher level categories. I would primarily seek to solve this upstream during the data generation. There are ways to solve this also in mia but details depend on how you like to solve this.

rowData(tse)[grepl("Abiotrophia_defectiva", rowData(tse)$Species), c("Kingdom", "Species")]

DataFrame with 2 rows and 2 columns Kingdom Species

s__Abiotrophia_defectiva k__Bacteria s__Abiotrophia_defec.. s__Abiotrophia_defectiva_1 NA s__Abiotrophia_defec..

3) Your example code didn't work, for instance "Species" was written "species" -> good to test the code in advance

4) Note that you don't need separate step to attach metadata to tse. Instead you can use col.data argument in the importer.

tse0 <- importMetaPhlAn("direct_merged-toy.txt", remove.suffix = TRUE, assay.type = "relabundance", col.data=meta)

Daenarys8 commented 4 months ago

Hi!

The rownames are not duplicated but instead they are named as for example "s__Abiotrophia_defectiva" and "s__Abiotrophia_defectiva_1" so on.

Thanks for the clarification. The appended suffix _* on the rownames(tse) happens when importing the Metaphlan results. This is because the experiment result(annotation data) is ambiguous and in order to avoid that during analysis, the taxonomic labels are made unique. This is essential for data integrity and downstream analyses. I guess it also avoids unnecessary complexity in the data analysis and simplifies the data interpretation.

Daenarys8 commented 4 months ago

As @antagomir rightly puts it, the ambiguity originates from the MetaPhlAn results.

antagomir commented 4 months ago

It is possible to collapse all Species into one category by specific commands but I think that is a bit risky and should first investigate if it is safe to do so