spectral-cockpit / opusreader2

Read OPUS binary files from Fourier-Transform Infrared (FT-IR) spectrometers of the company Bruker Optics GmbH & Co. in base R
https://spectral-cockpit.github.io/opusreader2/
MIT License
18 stars 2 forks source link

Adding a function to read raw string (without having to download the binary file) #25

Closed pierreroudier closed 2 years ago

pierreroudier commented 2 years ago

This PR reproduces the function opus_read_raw from the original opusreader.

It splits the read_opus_file function and extracts its core so that a raw stream can be passed to the function.

I'm not gonna lie, that's a functionality that is relevant for my personal use case -- where we "stream" directly the content of OPUS file over the web using web services!

It does not really affect the overall structure of the package/the code, though, and could even be hidden from the namespace should you choose to.

philipp-baumann commented 2 years ago

Hi Pierre, thanks for proposing :-) I think the streaming as you describe is an interesting use case. I think if you rely on streaming we for sure can support it in the package. @ThomasKnecht what is your opionion? We then have a separate function read_opus_file() that down in the stack calls read_opus_raw()?

philipp-baumann commented 2 years ago

Just out of interest, are you using SFTP protocol to achieve this? At the moment what I did at work is to configure the Windows measurment computer as Win OpenSSH server, so in a decent net I get 50MB/s read speed which is fine.

ThomasKnecht commented 2 years ago

yea, i have a look at it. do you have an example stream connection?

ThomasKnecht commented 2 years ago

maybe we create a method that checks if it is a file path or a url and dependent on that, it open a different connection

ThomasKnecht commented 2 years ago

i mesn the only thing that is different is the readBin. so we could make a read_opus function instead of _file or _raw and depending on the input it streams directly from the web or it uses the file

philipp-baumann commented 2 years ago

maybe we create a method that checks if it is a file path or a url and dependent on that, it open a different connection

agree, so this would then point back to the idea of implementing different drivers, like stars sf terra etc. do :-)

philipp-baumann commented 2 years ago

If you all agree, I would go like Pierre suggested and Thomas added to it:

  1. keep the first draft of read_opus_raw()
  2. implement read_opus(), which has a mechanism to differentiate disk file or web streaming
ThomasKnecht commented 2 years ago

i can make a draft later on

pierreroudier commented 2 years ago

Just out of interest, are you using SFTP protocol to achieve this? At the moment what I did at work is to configure the Windows measurment computer as Win OpenSSH server, so in a decent net I get 50MB/s read speed which is fine.

No, we have set up an API sitting on top of our soil observation database, where spectra (and other binary files) are stored as BLOB. We just query the spectra URIs using HTTP GET to download the BLOBs (which are raw).

pierreroudier commented 2 years ago

If you all agree, I would go like Pierre suggested and Thomas added to it:

1. keep the first draft of `read_opus_raw()`

2. implement `read_opus()`, which has a mechanism to differentiate disk file or web streaming

Yes, I think this is a good idea. read_opus being the one-stop shop, with slightly lower-level/more detailed methods available should the case arise (read_opus_file, read_opus_raw).

pierreroudier commented 2 years ago

maybe we create a method that checks if it is a file path or a url and dependent on that, it open a different connection

It's worth noting that the PR doesn't do that, it allows to read raw strings directly.

In my case, you still have to HTTP GET the BLOB and read it (I'm using rawConnection).

philipp-baumann commented 2 years ago

maybe we create a method that checks if it is a file path or a url and dependent on that, it open a different connection

It's worth noting that the PR doesn't do that, it allows to read raw strings directly.

In my case, you still have to HTTP GET the BLOB and read it (I'm using rawConnection).

@ThomasKnecht you mentioned we could as well implement an {rcurl} method for it. - @pierreroudier I guess your setup would apply for it right? To avoid a hard dependency, we might work with if (!require("rcurl")) and trigger an informative error. Same applies for the output arg, i.e. if we want to go with {jsonlite} etc.

ThomasKnecht commented 2 years ago

28

pierreroudier commented 2 years ago

@ThomasKnecht you mentioned we could as well implement an {rcurl} method for it. - @pierreroudier I guess your setup would apply for it right? To avoid a hard dependency, we might work with if (!require("rcurl")) and trigger an informative error. Same applies for the output arg, i.e. if we want to go with {jsonlite} etc.

That's a good question.

I'm tempted to say that in this case, I would handle this /outside/ the {opusreader2} package -- currently I have a dedicated package that works with our Science API.

This way it would keep the package simple, scoped, and nimble.

philipp-baumann commented 2 years ago

@ThomasKnecht you mentioned we could as well implement an {rcurl} method for it. - @pierreroudier I guess your setup would apply for it right? To avoid a hard dependency, we might work with if (!require("rcurl")) and trigger an informative error. Same applies for the output arg, i.e. if we want to go with {jsonlite} etc.

That's a good question.

I'm tempted to say that in this case, I would handle this /outside/ the {opusreader2} package -- currently I have a dedicated package that works with our Science API.

This way it would keep the package simple, scoped, and nimble.

I think the implementation Thomas drafted now is quite modular and simple, therefore I would like to include the new functionality 😎 I am thinking of spectral platforms, where we would like to be as flexible as possible. For example, if we put the binaries on an s3 storage, on a deducated instance, stream via rcurl and read directly is rather efficient.

philipp-baumann commented 2 years ago

@ThomasKnecht you mentioned we could as well implement an {rcurl} method for it. - @pierreroudier I guess your setup would apply for it right? To avoid a hard dependency, we might work with if (!require("rcurl")) and trigger an informative error. Same applies for the output arg, i.e. if we want to go with {jsonlite} etc.

That's a good question.

I'm tempted to say that in this case, I would handle this /outside/ the {opusreader2} package -- currently I have a dedicated package that works with our Science API.

This way it would keep the package simple, scoped, and nimble.

I think the implementation Thomas drafted now is quite modular and simple, therefore I would like to include the new functionality 😎 I am thinking of spectral platforms, where we would like to be as flexible as possible. For example, if we put the binaries on an s3 storage, on a deducated instance, stream via rcurl and read directly is rather efficient.a

philipp-baumann commented 2 years ago

@ThomasKnecht you mentioned we could as well implement an {rcurl} method for it. - @pierreroudier I guess your setup would apply for it right? To avoid a hard dependency, we might work with if (!require("rcurl")) and trigger an informative error. Same applies for the output arg, i.e. if we want to go with {jsonlite} etc.

That's a good question. I'm tempted to say that in this case, I would handle this /outside/ the {opusreader2} package -- currently I have a dedicated package that works with our Science API. This way it would keep the package simple, scoped, and nimble.

I think the implementation Thomas drafted now is quite modular and simple, therefore I would like to include the new functionality sunglasses I am thinking of spectral platforms, where we would like to be as flexible as possible. For example, if we put the binaries on an s3 storage, on a deducated instance, stream via rcurl and read directly is rather efficient.

Because we avoid dependencies, there is no extra cost. plus rcurl is really stable.

philipp-baumann commented 2 years ago

@pierreroudier we solved it with https://github.com/spectral-cockpit/opusreader2/pull/28

so now you can input raw

philipp-baumann commented 2 years ago

@pierreroudier we solved it with https://github.com/spectral-cockpit/opusreader2/pull/28

so now you can input raw

pierreroudier commented 2 years ago

Excellent -- so should I cancel/close this PR then?

philipp-baumann commented 2 years ago

Excellent -- so should I cancel/close this PR then?

we could leave it open, to remind that we can implement a URLs in the dsn (data source name) of read_opus