Integrate min.io container in Async Wrapper

nbrinckm commented 2 years ago

Make sure that data-model is compatible with file-storage

Note: wrapper must write files into file-storage. That means wrapper must know about file-storage's existence. (But wrapper also already must know about database's existence)

nbrinckm commented 2 years ago

Not sure why I should be the author of that issue. Maybe it is because I made the issue out of this draft...

nbrinckm commented 2 years ago

The main idea of having the filestorage here, is that we don't lose any of the information:

We don't want to lose output files, so that they can be re-used for later processing. So we need to keep those in the file storage.
We don't want to lose information about how the outputs are generated. While we save literal inputs directly in the database, we save only links for complex inputs - and the data behind those links can be removed or replaced. We need to keep those inputs in the file storage as well.

So, for the complex reference output it is quite easy. We can just save them in the file storage & replace the the original url with the one from the file storage.

nbrinckm commented 2 years ago

I'm not that sure for the complex inputs at the moment. We must store them as well - just to make sure that we have all information needed for the whole process chain.

But if we would store them in the same way as the outputs (just taking the url, fetch the file, store it on the file storage & replace the url in the further processing), we would get new urls all the time for the very same input data.

Explanation:

I send a processing for http://localhost/files/exposure/lima_large.json
The wrapper will make a new object out of it http://filestorage/bucket/random_number/lima_large.json
We would then work with it in the process
...
I start another proces that should work with http://localhost/files/exposure/lima_large.json
without any kind of bookkeeping I would get a new url for the upload on the file storage http://filestorage/bucket/different_number/lima_large.json
...

We then would not be able to see that both processes worked with the very same input. So we really need some kind of management to make sure that we don't process those files multiple times.

One easy strategy would just saving the original urls & the uploaded in a table. If we would re-process a url then, we would see that we already uploaded the file & can just re-use that url.

However, it has the problem that those data could change over time. Think on an WFS or WCS as the best example (layers stay the same but the underlying data changes).

So we could then add also checksums (sha1sum as git does). Then we also make sure that we keep track of those changes.

However, it has the consequence that it has to download the file in any case. It is a little bit annoying to do that with every processing step. ~~Maybe we could use the content length header instead - and just rely on the fact that we either have a different url for changed outputs, or at least a different size from the HEAD request.~~

nbrinckm commented 2 years ago

With my current understanding I suggest the following structure:

original_url	sha1sum	file_storage_url
http://public/url/file.ext	a3452dwd	http://filestorage/riesgosfiles/a3452dwd
http://temporary1.ext	aaaaaaaa	http://filestorage/riesgosfiles/aaaaaaaa
http://temporary1.ext	bbbbbbbb	http://filestorage/riesgosfiles/bbbbbbbb

What the wrappers must do when the get an complex input reference (to make it easy in python):

def replace_url_if_needed(url):
    # file_storage_urls is a set from the table with just the `file_storage_url` entries
    # sha1sums is a dict from the table that maps the `sha1sum` to the `file_storage_url` entries
    if url in file_storage_urls:
        return url
    content = requests.get(url).content
    sha1sum = hexlib.sha1(content).hexdigest
    if sha1sum in sha1sums.keys():
        return sha1sums[sha1sum]
    result_url = filestorage.upload(content)
    store_in_db(url, sha1sum, result_url)
    return result_url

For the complex output references we could just reuse the exact method (even that it would not be strictly necessary, as we would create normally completely new output files with different checksums).

nbrinckm commented 2 years ago

I'm not that sure now, how to handle the case that the filestorage is not completely readable from the outside.

Maybe it makes sense to add endpoints for downloading the inputs & outputs then from the outside:

/complex-inputs/<id>/file and /complex-outputs/<id>/file.

Benefits:

easy to implement (loading the entry with the id, fetching the url & returning the content)
dumping a collection would not need any more db lookup (just have the template for the url to the file endpoint)
easy to restrict permissions if needed (based on the job & order ids)

Cons:

two endpoints needed for basically the very same purpose (downloading files) & almost the same implementation

nbrinckm commented 2 years ago

I also think in case we need this mechanism to download via the backend - then it makes also sense to extend the method to replace the urls, so that we also can handle links to the endpoints of /complex-inputs/<id>/file and /complex-outputs/<id>/file that then should point to the real url that they have in the db (those pointing to the file storage).

nbrinckm commented 2 years ago

Expecially for the complex inputs I wonder if it makes more sense to also store the original url in the table - as it has important information about where this input comes from. Having only a url to the file storage (or our backend) doesn't give any idea about what kind of data it could be & where it is from.

(Things like https://rz-vm140.gfz-potsdam.de/wps/RetriveResult?id=abcd... still doesn't say a lot, but at least we can see it is the output of a wps server @ GFZ - and not from AWI, DLR, or some other institution).

nbrinckm commented 2 years ago

Expecially for the complex inputs I wonder if it makes more sense to also store the original url in the table - as it has important information about where this input comes from. Having only a url to the file storage (or our backend) doesn't give any idea about what kind of data it could be & where it is from.

(Things like https://rz-vm140.gfz-potsdam.de/wps/RetriveResult?id=abcd... still doesn't say a lot, but at least we can see it is the output of a wps server @ GFZ - and not from AWI, DLR, or some other institution).

Nevertheless it would be possible to extract that with the data structure that we have and that single one new table to map the urls.

nbrinckm commented 2 years ago

I'm not that sure now, how to handle the case that the filestorage is not completely readable from the outside.

Maybe it makes sense to add endpoints for downloading the inputs & outputs then from the outside:

/complex-inputs/<id>/file and /complex-outputs/<id>/file.

Benefits:

easy to implement (loading the entry with the id, fetching the url & returning the content)

dumping a collection would not need any more db lookup (just have the template for the url to the file endpoint)

easy to restrict permissions if needed (based on the job & order ids)

Cons:

two endpoints needed for basically the very same purpose (downloading files) & almost the same implementation

Mh, but the overall point of allowing only the download via the backend also makes the interaction with the WPS itself more difficult - as this must be able to fetch the data in any case (and without sending further headers).

Those just have to work from the outside - otherwise the WPS can't process the data at all.

Considering this, I see a larger problem with downloads via those /complex-inputs/<id>/file and /complex-outputs/<id>/file. endpoints. It doesn't make any sense for the WPS server perspective.

Regarding this, I also don't see really benefits in having an /download/<filename_in_filestorage> endpoint. The file storage itself will always be faster then the additional way via the backend. And as long as the bucket could be accessible for downloads wihtout additional credentials (no idea what the GFZ RZ policies will have here in the future), it just doesn't make any sense for permission management.

(Sorry for spamming you here, but this writing helps me on thinking & brainstorming...)

nbrinckm commented 2 years ago

If we would like to make the bucket accessible, we would need a reverse proxy (nginx) that would include the minio endpoint with our bucket (but we also did that in the past already for the sensor management system).

nbrinckm commented 2 years ago

Maybe some thoughts about the handling of the complex inputs urls when we send it to the wps server.

There are actually (at least) 2 ways:

send the original url to the wps server & store the input url only if the request was successful.
store & replace the url before sending it to the wps server (slower, but more robust IMHO as we can make sure that we have the input saved - and it will not disappear on the very next request in case it is a retrieve-result-url itself).

nbrinckm commented 2 years ago

Regarding to the download via the backend: I think with nginx we can still use the minio client with a backend like url (like /api/v1/downloads/ then).

It would not be done by the backend (but the minio directly), but it will look like it is done by a backend endpoint - and could be replaced if needed.

(Still I don't really see a point anymore in this idea - as the WPS must to able to acess those data in any case without further authentification mechanisms).

nbrinckm commented 2 years ago

Regarding to the download via the backend: I think with nginx we can still use the minio client with a backend like url (like /api/v1/downloads/ then).

It would not be done by the backend (but the minio directly), but it will look like it is done by a backend endpoint - and could be replaced if needed.

(Still I don't really see a point anymore in this idea - as the WPS must to able to acess those data in any case without further authentification mechanisms).

I think with the reverse proxy it should also work if we would use an external file storage service (say s3.gfz-potsdam.de).

nbrinckm commented 2 years ago

def replace_url_if_needed(url):

file_storage_urls is a set from the table with just the file_storage_url entries

sha1sums is a dict from the table that maps the sha1sum to the file_storage_url entries

if url in file_storage_urls: return url content = requests.get(url).content sha1sum = hexlib.sha1(content).hexdigest if sha1sum in sha1sums.keys(): return sha1sums[sha1sum] result_url = filestorage.upload(content) store_in_db(url, sha1sum, result_url) return result_url

Even if the probability with file collisions with sha1sum is really low, I guess I should still consider the original url before I reuse a file storage url.

The has the downside that I may store the very same file another time in the file storage, but as it then uses a different origin, it could be meaningful to do so nevertheless - just to point to the different original url.

nbrinckm commented 2 years ago

The first try is implemented in https://github.com/riesgos/async/pull/33

(But honestly I expect that it still has quite a lot of bugs & situations in that it doesn't work properly).

MichaelLangbein commented 1 year ago

Closing as minio is now nicely integrated. Feel free to re-open if discussion is still required.

riesgos / async

Integrate min.io container in Async Wrapper #29

file_storage_urls is a set from the table with just the `file_storage_url` entries

sha1sums is a dict from the table that maps the `sha1sum` to the `file_storage_url` entries

riesgos / async

Integrate min.io container in Async Wrapper #29

file_storage_urls is a set from the table with just the file_storage_url entries

sha1sums is a dict from the table that maps the sha1sum to the file_storage_url entries

file_storage_urls is a set from the table with just the `file_storage_url` entries

sha1sums is a dict from the table that maps the `sha1sum` to the `file_storage_url` entries