Closed li-fangyeo closed 4 months ago
@Daenarys8 do you see a solution here?
Hi @li-fangyeo , I would like to reproduce this problem you are facing. Can you share the dataset file? direct_merged.txt
If the data is sensitive you can replace it with similar toy data that generates the same error, and share by email.
Is there an email I can send to? I received a message saying the file size was too large.
On Mon, Jul 29, 2024 at 9:39 AM Leo Lahti @.***> wrote:
If the data is sensitive you can replace it with similar toy data that generates the same error, and share by email.
— Reply to this email directly, view it on GitHub https://github.com/microbiome/mia/issues/617#issuecomment-2255060125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWWR3KZVYOMPTWPF35JX7ADZOXPSTAVCNFSM6AAAAABLQKZGHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVGA3DAMJSGU . You are receiving this because you were mentioned.Message ID: @.***>
This kind of thing should be shared as a text/csv file instead of copypasted code.
For large files you can use https://filesender.funet.fi/
Oh dear, I didn't realise it looks different on github. Because I attached it as a text file in the email.
tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance")
tse <- agglomerateByRanks(tse) altExp(tse, "species")
I am unable to reproduce the duplicates you mentioned above.
> anyDuplicated(rownames(tse))
[1] 0
Oh dear, I didn't realise it looks different on github. Because I attached it as a text file in the email.
read in abundance table from Metaphlan tse <- importMetaPhlAn("direct_merged.txt",remove.suffix = TRUE, assay.type = "relabundance") #agglomerate by rank tse <- agglomerateByRanks(tse) altExp(tse, "species")
Perhaps, I didn't get the problem.
Hi!
The rownames are not duplicated but instead they are named as for example "s__Abiotrophia_defectiva" and "s__Abiotrophia_defectiva_1" so on.
Some observations.
1) It seems to me that you have that issue with rownames already with original data, not just after agglomeration
tse0 <- importMetaPhlAn("direct_merged-toy.txt",remove.suffix = TRUE, assay.type = "relabundance")
rownames(rowData(tse0)[grepl("Abiotrophia_defectiva", rowData(tse0)$Species),])
[1] "s__Abiotrophia_defectiva" "s__Abiotrophia_defectiva_1"
2) Agglomeration could not collapse these into a single category because these two rows differ in terms of Kingdom information even when Species is the same. It is not good default behavior to merge two Species if they differ in higher level categories. I would primarily seek to solve this upstream during the data generation. There are ways to solve this also in mia but details depend on how you like to solve this.
rowData(tse)[grepl("Abiotrophia_defectiva", rowData(tse)$Species), c("Kingdom", "Species")]
DataFrame with 2 rows and 2 columns Kingdom Species
s__Abiotrophia_defectiva k__Bacteria s__Abiotrophia_defec.. s__Abiotrophia_defectiva_1 NA s__Abiotrophia_defec..
3) Your example code didn't work, for instance "Species" was written "species" -> good to test the code in advance
4) Note that you don't need separate step to attach metadata to tse. Instead you can use col.data argument in the importer.
tse0 <- importMetaPhlAn("direct_merged-toy.txt", remove.suffix = TRUE, assay.type = "relabundance", col.data=meta)
Hi!
The rownames are not duplicated but instead they are named as for example "s__Abiotrophia_defectiva" and "s__Abiotrophia_defectiva_1" so on.
Thanks for the clarification. The appended suffix _*
on the rownames(tse)
happens when importing the Metaphlan results. This is because the experiment result(annotation data) is ambiguous and in order to avoid that during analysis, the taxonomic labels are made unique. This is essential for data integrity and downstream analyses. I guess it also avoids unnecessary complexity in the data analysis and simplifies the data interpretation.
As @antagomir rightly puts it, the ambiguity originates from the MetaPhlAn results.
It is possible to collapse all Species into one category by specific commands but I think that is a bit risky and should first investigate if it is safe to do so
Hi again.
May I know what I am doing wrong here? I am trying to agglomerate by rank but no matter if i do it per taxa level or with this agglomerateByRanks , I get duplicates of the species (see rownames). How do I get rid of that?
package.version('mia') [1] "1.13.34"
Best, Li-Fang