microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
47 stars 28 forks source link

rowData clean-up from character artifacts #406

Closed ChouaibB closed 1 year ago

ChouaibB commented 1 year ago

Hi,

This is related to the #303 discussion. A solution draft for cleaning up feature_data/rowData loaded from biom files that might contain some character artifacts (e.g. "). The attached pdf file represent tests of the function in question makeTreeSEFromBiom (which was also based on the example test at #303 ).

If it seems fine, I could update the examples at documentation using the function's argument, make unit tests, and bump the version.

Thanks.

makeTreeSEFromBiom_local_test.pdf

antagomir commented 1 year ago

checks to fix

ChouaibB commented 1 year ago

checks to fix

True, when I checked the job run log it seemed that it was related to other functions. And there seems to be not test script for makeTreeSEFromBiom. That's why I re-ran it, I stop the run now and do the updates first.

antagomir commented 1 year ago

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

ChouaibB commented 1 year ago

It is possible to place testdata in the package in inst/extdata/ but not sure if this is necessary for now.

Sure, I could give it a try and add the tests for makeTreeSEFromBiom if otherwise changes to the function are ok?

antagomir commented 1 year ago

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with: system.file("extdata/myfilename.end", package="mia")

ChouaibB commented 1 year ago

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of.

That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

ChouaibB commented 1 year ago

Hmm - one more change: it would be good to explicitly detect if there are such non-standard characters to remove, then throw a warning if such characters are being removed? This would make the process more safe. By default there should not be such extra characters, I think, and if there are it may indicate some problems that the users may wish to be aware of. That new file can be called with: system.file("extdata/myfilename.end", package="mia")

That's actually true, thanks. Is it ok to keep the default "artifact" pattern to detect and remove as \" and then generalize it by the time? bit hard for me to know what kind of non-wanted (artifact) characters that might occur in the taxonomy data within a biom file :/ ; unless ofc if the user priorly knows and could use the cleanTaxaPattern argument, and eventually throw a warning about the removal?

I update the system.file too, thanks.

Sorry for the many replies :) I was just thinking with my humble experience with Taxonomy data, it always contain alphabets (lower and upper cases), and numeric characters (alphanumeric characters) plus underscores and/or dashes, anything else should be an artifact to detect and remove (except the characters used to parse data into table e.g sep=; or sep=|), building such a regex do you think it would be safe and generalized for detection and removal (while ofc throwing the warning about it)?

antagomir commented 1 year ago

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

ChouaibB commented 1 year ago

Ok that could be the default now - but it should be argument that the user can tweak if necessary. We can later remove this if it turns problematic. The real solution would comply with biom standard, I am not sure what that says about the extra characters. There are pros & cons but in a way we also want to make this fluent to users so that they can focus on analysis of the data.

Before implementing anything, I was still curious about a way to detect non-standard character artifacts in Taxonomy data, so I made a small testing around based on an example file (attached in here): `

Example Data

Data was copied from Aggregated_humanization2.biom (manually through bash command less); the section corresponding to taxonomy data. Then saved to a text file; test_text.txt

Reading data

Just as an example to test the search for artifacts, the text was split based on "id", then each resulting list was made as a character vector:

test_text <- readr::read_lines("test_text.txt") %>% stringr::str_split("id") %>% lapply(., function(l) paste(unlist(l), collapse = " "))

Testing the regular expression on one example of the test_text

Retrieving taxonomy data including the separators that exists in taxonomy data which should be kept to parse them as a table afterwards.

Testing with one example, retrieving the taxonomy information usually wanted:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]", test_text[[1]] %>% stringr::str_split("") %>% unlist(), invert = FALSE, value = TRUE) %>% paste0(collapse = "")

result -> " 1726470, metadata taxonomy kBacteria, pBactero etes, cBactero ia, oBactero ales, fBactero aceae, gBactero es, 1726471, metadata "

Testing the negative (or invert) of the regular expression to retrieve the artifacts:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]", test_text[[1]] %>% stringr::str_split("") %>% unlist(), invert = TRUE, value = TRUE) %>% paste0(collapse = "") result -> "[{\"\":\"\"\"\":{\"\":[\"\\"\"\"\"\"\"\"\"\"\"\"\\"\"]}}{\"\":\"\"\"\":{\""

To collect the unique list of artifacts to be cleaned later:

grep("[[:alnum:]]|[-_,;|]|[[:space:]]", test_text[[1]] %>% stringr::str_split("") %>% unlist(), invert = TRUE, value = TRUE) %>% unique() result -> "[" "{" "\"" ":" "\" "]" "}"

Testing the regular expression on the whole test_text

Collecting the final unique list of character artifacts to be cleaned throughout the whole Taxonomy data:

lapply(testtext, function(x) { grep("[[:alnum:]]|[-,;|]|[[:space:]]", x %>% stringr::str_split("") %>% unlist(), invert = TRUE, value = TRUE) %>% unique() }) %>% unlist() %>% unique()

result -> "[" "{" "\"" ":" "\" "]" "}"

Consequently, this could be perhaps an automated way of detecting, and forming the patterns to be cleaned at the "makeTreeSEFromBiom" (which automatically will become 'cleanTaxaPattern="[{\":\]}"') if for example the user provide "cleanTaxaPattern=NULL" or smt similar (maybe even giving an value option 'auto' for instance), to hopefully safely parse the taxonomy data, which would safely parse rank names, taxonomy data afterwards in the function. ` test_text.txt

Ofc with the current implementation the biomformat::observation_metadata does some parsing and cleaning of some of the above seen artifacts in the raw data file. The idea I was thinking if this seems reasonable, a further utility/helper function(s) I could implement that would double check by detecting, cleaning and throwing the warning. P.S if it seems interesting please some hints how to improve the regex used in the example, thanks :)

antagomir commented 1 year ago

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

ChouaibB commented 1 year ago

Can you summarize in a much shorter way what is the problem and what is the proposed solution?

I will be committing the implementation I had in mind, then I guess it would show clearly.

antagomir commented 1 year ago

@ChouaibB note that @RiboRings is now making some changes in OMA https://github.com/microbiome/OMA/pull/316

ChouaibB commented 1 year ago

One question/suggestion @antagomir @TuomasBorman : If this implementation seems to be fine, should .detect_taxa_artifacts_and_clean and .detect_taxa_artifacts and .clean_from_artifacts be moved to R/taxonomy.R or R/utils.R that potentially could help check the taxonomy parsing of other types of data (e.g. makeTreeSummarizedExperimentFromDADA2.R or makeTreeSummarizedExperimentFromPhyloseq.R ) or such?

antagomir commented 1 year ago

You can move if it seems useful but I think it should be possible to use these regardless of the file where they are located.

ChouaibB commented 1 year ago

Ahh ok, I will keep them where they are for now. Regarding the failed checks (link): image It seems they are not related to the tests of makeTreeSummarizedExperimentFromBiom.R (actually I guess it fails before even reaching the tests related of test-10makeTreeSEFromBiom.R; locally these tests all pass with no problem), is somebody working on debugging those test checks that fails or shall I also look at them so this can be potentially merged if all is ok otherwise?

antagomir commented 1 year ago

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

ChouaibB commented 1 year ago

Strange that we have such errors suddenly appearing. Would be great if you can fix on the same go?

I will give it a try.

ChouaibB commented 1 year ago

About the failed checks: When I did the checks locally I took some notes of the reason of the error and the fix:

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-0diversity.R:27:5'): diversity estimates ─────────────────────
`lambda` not equal to .simpson_lambda(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:28:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to colData(tse_idx)$gini_simpson.
names for current but not for target
── Failure ('test-0diversity.R:29:5'): diversity estimates ─────────────────────
`ginisimpson` not equal to .calc_gini_simpson(assays(tse_idx)$relabundance).
names for current but not for target
── Failure ('test-0diversity.R:31:5'): diversity estimates ─────────────────────
`invsimpson` not equal to colData(tse_idx)$inverse_simpson.
names for current but not for target
── Failure ('test-0diversity.R:32:5'): diversity estimates ─────────────────────
`invsimpson` not equal to .calc_inverse_simpson(assays(tse_idx)$relabundance).
names for current but not for target
## ==> solution was to unname the named returned vector of each right-hand 
## side of the test 

── Failure ('test-0diversity.R:66:5'): diversity estimates ─────────────────────
`test1` not equal to `test2`.
Names: 3 string mismatches
## ==> solution was to unname both since they have same values but different 
## naming of the vectors at each side of the test

── Failure ('test-0diversity.R:67:5'): diversity estimates ─────────────────────
round(test1, 6) not equal to c(7.256706, 6.098354, 7.278894).
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:35:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled, "subsampled")) not equal to `expColSums`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal

── Failure ('test-8subsample.R:51:5'): subsampleCounts ─────────────────────────
colSums2(assay(tse.subsampled.rp, "subsampled")) not equal to `expColSumsRp`.
names for target but not for current
## ==> solution was to unname left-hand side of test otherwise values were equal  

Hopefully now it passes the checks at github :)

ChouaibB commented 1 year ago

merge when ready unless there are further comments

I will merge it for now. Perhaps when the utility/helper functions within this one turn to be useful at other importers, they could be moved to taxonomy.R. I was just thinking that, by then for future development any taxonomy related helper functions could be found at taxonomy.R so the functions and utilities and the type of utilities needed could be "easily readable and found" under R/ then all the developers instead of going through each function and searching for their helper function that could be re-used; then naming the type of utility functions under R/ could be helpful and more easily readable. But this is just an opinion ofc :)