msberends / AMR

Functions to simplify and standardise antimicrobial resistance (AMR) data analysis and to work with microbial and antimicrobial properties by using evidence-based methods, as described in https://doi.org/10.18637/jss.v104.i03.
https://msberends.github.io/AMR/
Other
83 stars 12 forks source link

Attributes lost after exporting/importing data sets? #22

Closed jukkiebah closed 3 years ago

jukkiebah commented 3 years ago

I have a question regarding the use of AMR and the special classes that can be made with AMR.

Often I export datafiles, and save them. After I import them back into r studio, I lose the attributes that were obtained using the scripts in AMR. For instance the special classes like mo, rsi, mic. Of course I can restore them by using the original as.mo and as.rsi functions. However, these are the ones that take up the most time to run. Especially as.mo and as.disk take >10 minutes to run on my data set. This is very inconvenient, so i'm looking for a solution. I've tried a number of options like, readr, feather, rds. But none of these provide a solution. How do you go about this problem?

thanks!

msberends commented 3 years ago

Good question. The classes that the AMR package provides are, just like all custom R classes in any package, only supported by R. This means that exporting the data to other formats than R will lose the attributes. Fortunately, this is not a big problem. Since the data were converted by the AMR package in the first place, rerunning functions like as.mo() and as.disk() will only take milliseconds. The functions are built so that they first check if all values that are to be transformed, already look like valid values.

As an example:

microbenchmark::microbenchmark(x <- as.mo(c("E. coli test",
                                            "Staph aureus kind of",
                                            "Some Str. pyogenes")),
                               times = 5)
#> Unit: seconds
#> expr       min       lq     mean   median       uq      max neval
#>    x  7.516604 7.569028 7.584643 7.576983 7.579677 7.680924     5

# `x` now consists of valid MO values:
x
#> Class <mo>
#> [1] B_ESCHR_COLI      B_STPHY_AURS_ANRB B_STRPT_PYGN     

# transform to character, as it would be in e.g. a CSV
x <- as.character(x)
x
#> [1] "B_ESCHR_COLI"      "B_STPHY_AURS_ANRB" "B_STRPT_PYGN"     

# now run again:
microbenchmark::microbenchmark(x <- as.mo(x), times = 5, unit = "s")
#> Unit: seconds
#>           expr         min          lq        mean     median          uq         max neval
#>  x <- as.mo(x) 0.004543719 0.004590081 0.004932438 0.00467622 0.004956594 0.005895575     5

The median time for this small test went from 7.6 seconds to 0.005 seconds. Also for as.rsi(), as.disk() or as.mic(), if you used any of those functions before, running it again on data where the attributes were lost will be very, very fast.

Of course it's better to prevent the need to transform variables again, which is only possible with rds. You already say that you use this, so I don't understand why attributes are lost on your system. As an example test (examples_isolates is an example data set from the AMR package that contains an <mo> column and numerous <rsi> columns):

saveRDS(example_isolates, "test.rds")
test <- readRDS("test.rds")
identical(example_isolates, test)
#> [1] TRUE

Since identical() also checks attributes, this means that the newly imported test data set is identical to the example data set, including all attributes. Exactly what one would expect from exporting to RDS and importing again. So not sure what happened in your script?

jukkiebah commented 3 years ago

thanks, i am still learning. I had switched to feather because of its increased speed with the large files that i am handling (>2Gb). It hadn't realised that RDS is within R. I thought i would also loose the attributes. This is not the case. I switched back to RDS now. thanks!

msberends commented 3 years ago

No problem at all! We’re all still learning, that’s what’s great about new methods 😄