microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
48 stars 28 forks source link

Taxonomy parsing #303

Closed antagomir closed 1 year ago

antagomir commented 2 years ago

Consider using these parsing functions in examples, or having analogous functions to

Currently OMA examples handle this with:

# Goes through the whole DataFrame. Removes '.*[kpcofg]__' from strings, where [kpcofg] 
# is any character from listed ones, and .* any character.
rowdata_modified <- BiocParallel::bplapply(rowData(se), 
                                           FUN = stringr::str_remove, 
                                           pattern = '.*[kpcofg]__')

Which is handy but wrappers might help less experienced users.

antagomir commented 1 year ago

Would this be handy with e.g. FR02 work?

ChouaibB commented 1 year ago

The current experience with the FR02 work relies on mia::loadFromQIIME2 QIIME2 files or mia::makeTreeSEFromPhyloseq when the phyloseq object is pre-made, as well as mia::loadFromBiom when the object is a biom file, then I would need to clean the rowData in similar fashion concerning the biom case. While working for the FR02 having mia_1.7.7 , I unfortunately did not notice that mia::makeTreeSEFromBiom was available that eventually take care of the taxonomy naming clean up. I will update at FR02, thanks.

antagomir commented 1 year ago

Doees this mean that mia is already having the necessary functionality, compared to:

If so, we can close this issue?

ChouaibB commented 1 year ago

Sorry, I misunderstood the question. So far mia has functions for Humann, Metaphlan, Mothur, QIIME2, biom, DADA2 and Phyloseq. The mia taxonomy.R seems to have a lot of utilities concerning handling and wrangling Taxonomy data that could be perhaps seen as the phyloseq::parse_taxonomy_default. I will take a deeper look at the phyloseq methods above, since at first look I was concerned about the type of file output given the pipeline used rather than the database used during metagenomic inferences. I will check.

antagomir commented 1 year ago

Ok if there is no immediate need, we could also close this and just keep it in mind.

However it mia/OMA users currently need to do some notable manual parsing after importing data (check mia/OMA examples?), then it could worth fixing that immediately, and perhaps this gives some idea on how to do.

ChouaibB commented 1 year ago

I have just being checking at the moment the source code related to the methods above. For example phyloseq::parse_taxonomy_greengenes and phyloseq::parse_taxonomy_qiime and phyloseq::parse_taxonomy_default are working/calling each other, and what I think is good about them is that their input and outputs are in form of character vectors (which mostly Taxonomy namings are). What I think if it's ok to use their methods for example at the mia taxonomy.R and extend/further parse the character vector output (from the phyloseq functions above) into a table that would fit rowData slot of the TreeSEs; and/or furthermore create wrappers at mia (corresponding to the versions above) that would rely on the phyloseq Taxonomy parsing (for the versions mentioned above) and eventually output TreeSEs. Would you think that this would be practical?

antagomir commented 1 year ago

Is some relevant functionality currently missing from mia that this would solve?

We would not import phyloseq functions but we can see if there are useful additional ideas that help to simplify TreeSE-based workflows further.

ChouaibB commented 1 year ago

Actually after checking further :) the mia:::.parse_taxonomy function is very practical and flexible, that would eventually work as well I guess for the earlier version of QIIME (not QIIME2); couldn't find an example raw file taxonomy output of QIIME online but the examples shown here reveals that the taxonomy table output is no different from the QIIME2 (example output). Therefore, the arguments sep and column_name of the mia:::.parse_taxonomy is nicely generic for multiple outputs, which would work as well for qiime2 Greegenes (example taxa table at figure). Therefore, I think after some testing, it could be mentioned at documentation that the mia importer functions in question could work for earlier version of QIIME2 and would work for Greengenes (perhaps Greengenes2 as well). Regarding the OMA example mentioned above, if it is corresponding to this I guess mia:::.parse_taxonomy would work in one go; and perhaps extend the same utility function with a parameter pattern if the user knows that the raw file has some anomalies compared to expect normal raw file. I could give it a try/test at OMA if needed?

antagomir commented 1 year ago

Great.

Ideally, all this is already dealt with in the import function so that the user can be saved from the manual modifications of the taxonomy table.

In the linked OMA example, manual modifications seem to be necessary - is there a way to improve the importer so that it could readily handle this automatically, or semi-automatically for the user? Then the OMA example could be simplified.

Exporting mia:::.parse_taxonomy could be an option if it seems that the importers cannot readily handle all/most cases. Otherwise I would not see a need for exporting this function.

ChouaibB commented 1 year ago

After looking for the mia version used while rendering the OMA book, I realized it was using mia_1.9.9 (example source: at the ssessioninfo of the run log). That mia version includes the makeTreeSEFromBiom function which could simplify the OMA example in question. I hereby attached (a pdf rendered version) as an example of using the makeTreeSEFromBiom function for that same OMA example. Yet, due to the original data file used, the Kindgom and Genus ranks still needs to be cleaned form the extra \" characters appearing in their naming. 04_containers_3_4_5_1_Biom_import_TEST.pdf

antagomir commented 1 year ago

1) The rank names can be inferred from the prefixes that are being removed as part of the makeTreeSEFromBiom call. In my view it would be enough to show just that example (from side note), with a small explanation what happens there. Two examples for this kind of thing is a bit much.

2) Is there a chance to improve the importer so that it could automatically detect and remove such special characters as part of prefix removal? It somehow feels that OMA should not necessarily allocate too much space for teaching basic string manipulation during data import. Moreover, if this is something that could be safely automated it can help many users.

3) The importer could also have an option for adding the rownames as a new rowData field, with a specified name. Sometimes the rownames correspond to OTUs/ASVs/OGUs or other such entities, and user may like to have these also as a field in rowData, for plotting and other purposes. See tibble::rownames_to_column.

4) Setting sample names in the metadata, how about using tibble::column_to_rownames() instead of rownames(sample_meta) <- sample_meta[,1]; sample_meta[,1] <- NULL

5) Why not read the metadata file with DataFrame(read.csv(sample_meta_file_path, sep = ",", header = TRUE)) as this will read the headers automatically (no need to add afterwards; although it must be made sure that examples will work with the original names).

6) reading and adding the tree file could perhaps go just on a single line? And the "check" part after adding the tree can be removed to keep the text more clear.

antagomir commented 1 year ago

Overall, simplifying the examples is desirable, and the importer can be improved. Just that the improvements to the importer should remain relatively generic as we do not want to handle all very special cases separately, ideally we support the predefined standards.

ChouaibB commented 1 year ago

Regarding the test presented at that PDF example about using both removeTaxaPrefixes=TRUE and rankFromPrefix=TRUE at mia::makeTreeSEFromBiom , I think at the source code line 124 if it could be changed/checked for possible artifacts to be removed as e.g.: colnames(feature_data) <- colnames %>% stringr::str_remove(pattern = '\"') and possible patterns that could occure, then the taxonomy rank names would be parsed safely if the rankFromPrefix=TRUE at mia::makeTreeSEFromBiom is used.

Regarding cleaning the Kingdom and Genus ranks (at the same example) when using removeTaxaPrefixes=TRUE at mia::makeTreeSEFromBiom, updating the patterns search for at line 142 to for example patterns <- "sk__|\"|([dkpcofgs]+)__" and other possible artifacts that could occur would parse and clean rowData safely; e.g. image

antagomir commented 1 year ago

Sounds good.

antagomir commented 1 year ago

Btw. the OMA related parts would rather form an issue in OMA but ok for now like this if we could troubleshoot it rather soon.

ChouaibB commented 1 year ago
  1. The rank names can be inferred from the prefixes that are being removed as part of the makeTreeSEFromBiom call. In my view it would be enough to show just that example (from side note), with a small explanation what happens there. Two examples for this kind of thing is a bit much.

    1. Is there a chance to improve the importer so that it could automatically detect and remove such special characters as part of prefix removal? It somehow feels that OMA should not necessarily allocate too much space for teaching basic string manipulation during data import. Moreover, if this is something that could be safely automated it can help many users.

    2. The importer could also have an option for adding the rownames as a new rowData field, with a specified name. Sometimes the rownames correspond to OTUs/ASVs/OGUs or other such entities, and user may like to have these also as a field in rowData, for plotting and other purposes. See tibble::rownames_to_column.

    3. Setting sample names in the metadata, how about using tibble::column_to_rownames() instead of rownames(sample_meta) <- sample_meta[,1]; sample_meta[,1] <- NULL

    4. Why not read the metadata file with DataFrame(read.csv(sample_meta_file_path, sep = ",", header = TRUE)) as this will read the headers automatically (no need to add afterwards; although it must be made sure that examples will work with the original names).

    5. reading and adding the tree file could perhaps go just on a single line? And the "check" part after adding the tree can be removed to keep the text more clear.

ok, I just focused on the rowData cleaning parts as it was the main topic of this discussion :) but I left the rest from the OMA example as is... (the sample_meta and so on). I agree the rest of the example after the rowData cleaning could be improved as well. Yes, the makeTreeSEFromBiom function rather needs to be improved to keep the tutorials and examples concise and informative.

ChouaibB commented 1 year ago

Btw. the OMA related parts would rather form an issue in OMA but ok for now like this if we could troubleshoot it rather soon.

Yes, I was thinking should the makeTreeSEFromBiom function be updated first here, then move to OMA and update the example?

antagomir commented 1 year ago

Yes!

ChouaibB commented 1 year ago

ok, thanks. I will give it a try :)