henningte commented 1 year ago

Great package that you have developed!

I am interested in linking 'opusreader2' to my package 'ir' by writing a function to cast the data list returned by read_opus() to ir objects. This would be much easier if the data list returned by read_opus() had its own dedicated S3 class.

Do you plan to implement such a class in the future?

And one additional question: Do you plan a CRAN release in the future?

philipp-baumann commented 1 year ago

Hi Henning,

happy to hear you like the package and find it useful.

I like the the idea to attach some S3 class to the final output list of read_opus(). Could be something like list_read_opus. Any other propositions?

Along what you propose, I think could be quite nice to have a nice print generic for this class. While a lot of people certainly use the list viewer in RStudio local mode, I often accidentally print list remote exploration (ssh) in e.g. VSCode and my console gets flushed. Such a pretty print method would be a nice little side effect.

Also along with you casting endeavor into the custom IR data class --- from a quick look it's sort of a list column data.frame format with long format spectra data.frames items, right? --- my friend @ThomasKnecht has started drafting a metadata extractor to post-process results from read_opus() in a helper. The proposal is mentioned in #66 and the current state is here.

https://github.com/spectral-cockpit/opusreader2/tree/feat-44-extract-metadata

I'm planning to work further on it while Thomas is travelling (we can for sure ping him once there is a proposition). It would be nice if you stayed in the loop for feedback and suggestions or even PRs :-)

Does this sound like a plan?

Right now we're at version v0.1.0 and just made a release on GH. I consider after adding a class and completing some basic vignette that the main interface is now stable. So to speak this definitely calls for a CRAN release in 0.2.0 very soon :-)

Cheers, Philipp

philipp-baumann commented 1 year ago

66

philipp-baumann commented 1 year ago

Why I'm suggesting to put the metadata extractor in the loop is because there is quite a bit of rectangular linking of OPUS data that we can chip in to simplify. It's really a lot of data to digest ;-) Also read a bit more about IR. Quite like it. It has a bit more processing features than simplerspec,that relies a lot on prospectr which is a really nice package. Regarding data structures generally I'm ATM opting for some python integration like zarr (array-like) on disk storage and implement some high-level wrapper into R. I really like R but I think it is missing some proper array features. Maybe a bit too far from this issue but just some general thoughts.

henningte commented 1 year ago

Hi Philipp,

thanks for your quick and positive reply! Your suggestions all sound good :)

I am happy to provide feedback, but I won't have the time to write a PR.

Here are some ideas to the points you've mentioned:

Class name: Another name for the class could be opusreader, if this will be the only top-level class of the package. afaik, this is the case?
ir structure:

Also along with you casting endeavor into the custom IR data class --- from a quick look it's sort of a list column data.frame format with long format spectra data.frames items, right?

Yes, that's right. I think in comparison to existing packages, the advantages are a full support of tidyverse methods and many methods to manipulate spectral data as a long-format data frame (e.g. subsetting of columns).

metadata extractor:

my friend ThomasKnecht has started drafting a metadata extractor to post-process results from read_opus() in a helper.

I haven't had a deep look into the structure of the list returned by read_opus() and its elements yet. From what I've seen in extract_metadata.R, this will already create data frames with metadata with a row for each spectrum, right? Yes, this should make casting easier :)

Will the final class be a list or some sort of data frame?

I also like 'simplerspec'. I already would have implemented a casting method if 'simplerspec' had an S3 class (I think it has not, right? Maybe an idea for the future :) ). Unfortunately, many spectral R packages do not have classes for their data structures which makes it a bit difficult to link packages to use different packages for what they are good at. 'ChemoSpec' and 'hyperSpec' ,which also are great packages I borrow functions from, have own classes, but 'prospectr' has for example no own class.

Best wishes,

Henning

philipp-baumann commented 1 year ago

1. **Class name:**
   Another name for the class could be `opusreader`, if this will be the only top-level class of the package. afaik, this is the case?

Thanks for the suggestion. I would prefer calling it "opusreader2" to not confuse it with the old opusreader package. I am now somewhere between "list_opusreader2" and "list_read_opus"

2. **`ir` structure:**
Also along with you casting endeavor into the custom IR data class --- from a quick look it's sort of a list column data.frame format with long format spectra data.frames items, right?

Yes, that's right. I think in comparison to existing packages, the advantages are a full support of tidyverse methods and many methods to manipulate spectral data as a long-format data frame (e.g. subsetting of columns).

Got you. Yes for plotting the long form for sure comes handy. On the other hand many statistical functions build on top of long data frames or matrices.

3. **metadata extractor:**
my friend ThomasKnecht has started drafting a metadata extractor to post-process results from read_opus() in a helper.

I haven't had a deep look into the structure of the list returned by read_opus() and its elements yet. From what I've seen in extract_metadata.R, this will already create data frames with metadata with a row for each spectrum, right? Yes, this should make casting easier :)

Yes exactly :-)

Will the final class be a list or some sort of data frame?

4. I also like 'simplerspec'. I already would have implemented a casting method if 'simplerspec' had an S3 class (I think it has not, right? Maybe an idea for the future :) ). Unfortunately, many spectral R packages do not have classes for their data structures which makes it a bit difficult to link packages to use different packages for what they are good at. 'ChemoSpec' and 'hyperSpec' ,which also are great packages I borrow functions from, have own classes, but 'prospectr' has for example no own class.

It's gonna be like two or three data frames of metadata with shared keys.

Good point. Leo's thought behind prospectr was really to deal wherever possible with matrix. Although this could for sure something useful to attach attributes including a class one. Certainly, we need some tooling to annotate stuff as precision wavenumber attributes etc.

Cheers

philipp-baumann commented 1 year ago

I am happy to provide feedback, but I won't have the time to write a PR.

Great, I can implement it and you can test it/give inputs/feedback.

philipp-baumann commented 1 year ago

Short note, short relatively informal definitions:

list2_opusreader2: "lower-level" list returned from read_opus() when nfiles == 1
list_opusreader2: "high-level" list returned from read_opus() when nfiles > 1

It is rare to have just one file, but still we need this to differentiate.

henningte commented 1 year ago

Can you please briefly explain why there is a need to have an own class if there's only one spectrum in the list? In what cases will functions handle the output of read_opus() differently?

Edit:

To make my point more concrete, a suggestion could be to:

Make parse_opus() an internal function (per the documentation, it is intended as internal function) by removing the @export tag and including the @keywords internal tag. This function can return a list.
Make read_opus() return always a list of spectra and create an own class only for this list (this corresponds to list_opusreader2). Since the output always has the same structure, this makes postprocessing much easier and you won't have to write all methods for two classes.

philipp-baumann commented 1 year ago

Can you please briefly explain why there is a need to have an own class if there's only one spectrum in the list? In what cases will functions handle the output of read_opus() differently?

Just to make sure I understand you correctly. Does it not matter if a length 1 list is flattened? The structure is different in this case.

Edit:

To make my point more concrete, a suggestion could be to:

1. Make `parse_opus()` an internal function (per the documentation, it is intended as internal function) by removing the `@export` tag and including the `@keywords internal` tag. This function can return a `list`.

The question is if there is a use case a user might take control and call this? E.g. for own APIs of an organization.

2. Make `read_opus()` return always a list of spectra and create an own class only for this list (this corresponds to `list_opusreader2`). Since the output always has the same structure, this makes postprocessing much easier and you won't have to write all methods for two classes.

I am not sure, maybe I can add a short example.

philipp-baumann commented 1 year ago

The advantage of restricting exports is to have the exposed set of things more clean. Downside is if one ever makes down-stream stuff like packages that builds on low-level stuff of opusreader2.

philipp-baumann commented 1 year ago

devtools::load_all()
#> ℹ Loading opusreader2

file <- opus_file()
files <- rep(file, 10)

data_1 <- read_opus(dsn = file)
data <- read_opus(dsn = files)

identical(data_1, data[[1]])
#> [1] TRUE

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Ubuntu 22.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Zurich
#>  date     2023-02-14
#>  pandoc   2.9.2.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  !  package     * version date (UTC) lib source
#>     backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
#>     brio          1.1.3   2021-11-30 [1] CRAN (R 4.1.2)
#>     cachem        1.0.6   2021-08-19 [1] CRAN (R 4.1.1)
#>     callr         3.7.3   2022-11-02 [1] CRAN (R 4.1.3)
#>     cli           3.6.0   2023-01-09 [1] CRAN (R 4.1.3)
#>     crayon        1.5.2   2022-09-29 [1] CRAN (R 4.1.3)
#>     desc          1.4.2   2022-09-08 [1] CRAN (R 4.1.3)
#>     devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.3)
#>     digest        0.6.31  2022-12-11 [1] CRAN (R 4.1.3)
#>     ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>     evaluate      0.20    2023-01-17 [1] CRAN (R 4.1.3)
#>     fansi         1.0.4   2023-01-22 [1] CRAN (R 4.1.3)
#>     fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
#>     fs            1.6.1   2023-02-06 [1] CRAN (R 4.1.3)
#>     glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>     htmltools     0.5.4   2022-12-07 [1] RSPM (R 4.1.0)
#>     htmlwidgets   1.6.1   2023-01-07 [1] CRAN (R 4.1.3)
#>     httpuv        1.6.8   2023-01-12 [1] CRAN (R 4.1.3)
#>     knitr         1.42    2023-01-25 [1] CRAN (R 4.1.3)
#>     later         1.3.0   2021-08-18 [1] CRAN (R 4.1.1)
#>     lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.1.3)
#>     magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>     memoise       2.0.1   2021-11-26 [1] CRAN (R 4.1.2)
#>     mime          0.12    2021-09-28 [1] CRAN (R 4.1.2)
#>     miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
#>  VP opusreader2 * 0.1.0   2022-12-23 [?] Github (spectral-cockpit/opusreader2@102bde3) (on disk 0.0.0.9000)
#>     pillar        1.8.1   2022-08-19 [1] CRAN (R 4.1.3)
#>     pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.3)
#>     pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>     pkgload       1.3.2   2022-11-16 [1] CRAN (R 4.1.3)
#>     prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.1.2)
#>     processx      3.8.0   2022-10-26 [1] CRAN (R 4.1.3)
#>     profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.3)
#>     promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
#>     ps            1.7.2   2022-10-26 [1] CRAN (R 4.1.3)
#>     purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.3)
#>     R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>     Rcpp          1.0.10  2023-01-22 [1] CRAN (R 4.1.3)
#>     remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.3)
#>     reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>     rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.3)
#>     rmarkdown     2.20    2023-01-19 [1] CRAN (R 4.1.3)
#>     rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.1.3)
#>     rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.1.3)
#>     sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>     shiny         1.7.4   2022-12-15 [1] RSPM (R 4.1.0)
#>     stringi       1.7.12  2023-01-11 [1] CRAN (R 4.1.3)
#>     stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.3)
#>     styler        1.5.1   2021-07-13 [1] CRAN (R 4.1.1)
#>     testthat    * 3.1.6   2022-12-09 [1] CRAN (R 4.1.3)
#>     tibble        3.1.8   2022-07-22 [1] CRAN (R 4.1.3)
#>     urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.3)
#>     usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.3)
#>     utf8          1.2.3   2023-01-31 [1] CRAN (R 4.1.3)
#>     vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.3)
#>     withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>     xfun          0.37    2023-01-31 [1] CRAN (R 4.1.3)
#>     xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
#>     yaml          2.3.7   2023-01-23 [1] CRAN (R 4.1.3)
#> 
#>  [1] /home/philipp/R/x86_64-pc-linux-gnu-library/4.1
#>  [2] /opt/R/4.1.3/lib/R/library
#> 
#>  V ── Loaded and on-disk version mismatch.
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

^{Created on 2023-02-14 by the reprex package (v2.0.1)}

would be nice if reprex would not mess with all these other packages I don't use/need ;-)

philipp-baumann commented 1 year ago

@henningte an option would also be that you wrap one level up again if you just find one spectrum.

philipp-baumann commented 1 year ago

by flattening I mean that `identical(data_1, data[[1]])

> [1] TRUE` which is a side effect.

henningte commented 1 year ago

But don't you prefer a consistent function --- a function which returns the same object type (a list of spectra) for the same input type (a character vector of file names)?

The question is if there is a use case a user might take control and call this? E.g. for own APIs of an organization.

Of course, this is a design question which is up to you :)

If read_opus() provides a consistent format which extracts all metadata from OPUS files and is not slow, what speaks against exposing only this function? Users who want their own API could also build their API on top of the output of read_opus(), couldn't they?

In any case, the license would allow to copy code. If someone goes that far to write their own API, then I think this shouldn't be too difficult for them ;)

Exposing only one function would have the advantages that the package becomes easier to maintain, to create, and also easier to understand for those with the most common use cases.

You could still create an additional class for single spectra objects, but I think it would be preferable if read_opus() had a consistent output and this additional class would have a different function in the package (not just a different length). For example, it could make perfectly sense if printing is different for a list of spectra and a single element of such a list. Then, it wold make sense to have two classes.

by flattening I mean that identical(data_1, data[[1]]) #> [1] TRUE which is a side effect.

Supposing that read_opus() would always return a list, you can still compare with identical(data_1, data[1]), couldn't you?

Maybe I don't see your point here :)

philipp-baumann commented 1 year ago

wait a min.

philipp-baumann commented 1 year ago

oke short rewrap, had a brief talk with @ThomasKnecht . Summarizing briefly here:

We like your suggestion of attaching a class to the output.
The attachment of the class should happen inside parse_opus()
Whether it is a list of opusreader2 object or a single opusreader2 instance does not matter.

philipp-baumann commented 1 year ago

Supposing that read_opus() would always return a list, you can still compare with identical(data_1, data[1]), couldn't you?

Maybe I don't see your point here :)

not to be pointless, guess have been doing too much cloud storage stuff, hence my view might be a bit convoluted. But wait, I got an idea. What is simpler? Undoing the unnesting or really having two classes?

philipp-baumann commented 1 year ago

You are right about speed and about the keeping core stuff in modules. Speed will never be an issue, rather memory at the moment because we unfold the wavenumber vector from seq from to by to a quite long vector for every spectrum.

philipp-baumann commented 1 year ago

@henningte you got us with type consistency. ;-) We want to have typed behavior.

philipp-baumann commented 1 year ago

ATM read_opus behaves a bit like mapply rather that Map as an analogy, so to speak we do like as simplify = TRUE if it was a functional.

ThomasKnecht commented 1 year ago

i see the point here. the problem is, that we allow the input of multiple files. the question is: should the fuction take care of it or the user. in my point of view, the class "opusreader2" should be assigned to the output of a single file. it has its special structure which is defined by the class. the multiple file output is basically just a list of these output objects. so the main structural question that i raise is: should we only export the function for one file and let the user loop over the files to get rid of the confusion?

philipp-baumann commented 1 year ago

i see the point here. the problem is, that we allow the input of multiple files. the question is should the fuction take care of it or the user. in my point of view, the class "opusreader2" should be assigned to the output of a single file. it has its special structure which is defined by the class. the multiple file output is basically just a list of these output objects. so the main structural question that i raise is: should we only export the function for one file and then the user loops over the files to get rid of the confusion?

Yes, agree. My point is why we did all this is to make it as easy as possible to the user. The dsn idea is quite powerful, and indeed on the downside introduces a side-effect, we cannot predict whether one or several OPUS binaries are in the folder. My question is, what if we do one regression sort of, and we provide always the same stable output, which is a list of readopus2 objects. i.e. we revert the unnesting of the list in the case of one file.

philipp-baumann commented 1 year ago

All interesting @ThomasKnecht and @henningte now we just have do make sure the snake does not bite its tail ;-) What speaks against exporting the high-level wrapper read_opus() and read_opus_impl(), which is the version for one file. ?

ThomasKnecht commented 1 year ago

we could still keep this wrapper function and publish it as well but we need change the name of the function. and then the output class could be something like list_opusreader2 containing the opusreader2 objects. but i am not a fan of assigning the opusreader2 class to the output list of all files since as i mentioned above it should be the class of a single output object! this has in the end the specific structure... its like having a list of tibbles or dataframes and in our case its a list of opusreader2 objects...

sorry for my bad and probably not easy to understand english!

philipp-baumann commented 1 year ago

we could still keep this wrapper function and publish it as well but we need change the name of the function. and then the output class could be something like list_opusreader2 containing the opusreader2 objects. but i am not a fan of assigning the opusreader2 class to the output list of all files since as i mentioned above it should be the class of a single output object! this has in the end the specific structure... its like having a list of tibbles or dataframes and in our case its a list of opusreader2 objects...

We could really think about renaming that read_opus_impl(), yes. And yes we should reserve this main class for the individual entries.

philipp-baumann commented 1 year ago

=> Let's make our life really simple, rename read_opus_impl() to read_opus_single(). then we export both read_opus(), the polyglot one and read_opus_single() the solitary.

read_opus_single() => class "readopus2"
read_opus() => classes "readopus2_list" and "readopus2"

philipp-baumann commented 1 year ago

Like this we have proper inheritance and type stability.

philipp-baumann commented 1 year ago

@ThomasKnecht and @henningte the PR in #72 would be ready to be merged. If you both agree, I would pull the merge button after a short confirmation of good to go by Thomas. Hope it addresses everything here :-) should

philipp-baumann commented 1 year ago

73 is the next PR to follow that tackles only exporting what is really needed/useful for both users and developers.

philipp-baumann commented 1 year ago

devtools::load_all()
#> ℹ Loading opusreader2

file <- opus_file()
files <- rep(file, 10)

data_1 <- read_opus_single(dsn = file)
data <- read_opus(dsn = files)

class(data_1)
#> [1] "opusreader2" "list"
class(data)
#> [1] "list_opusreader2" "list"
class(data[[1]])
#> [1] "opusreader2" "list"

^{Created on 2023-02-15 by the reprex package (v2.0.1)}

philipp-baumann commented 1 year ago

@henningte it is now merged and I just bumped v0.2.0 . If you are happy with that, feel free to close this ticket. Thanks again and cheers, Philipp

henningte commented 1 year ago

Thank you for the quick solution!

spectral-cockpit / opusreader2

Dedicated S3 class for the returned value of `read_opus()` planned? #68

66

> [1] TRUE` which is a side effect.

73 is the next PR to follow that tackles only exporting what is really needed/useful for both users and developers.