Closed nbrinckm closed 1 year ago
Not sure why I should be the author of that issue. Maybe it is because I made the issue out of this draft...
The main idea of having the filestorage here, is that we don't lose any of the information:
We don't want to lose output files, so that they can be re-used for later processing. So we need to keep those in the file storage.
We don't want to lose information about how the outputs are generated. While we save literal inputs directly in the database, we save only links for complex inputs - and the data behind those links can be removed or replaced. We need to keep those inputs in the file storage as well.
So, for the complex reference output it is quite easy. We can just save them in the file storage & replace the the original url with the one from the file storage.
I'm not that sure for the complex inputs at the moment. We must store them as well - just to make sure that we have all information needed for the whole process chain.
But if we would store them in the same way as the outputs (just taking the url, fetch the file, store it on the file storage & replace the url in the further processing), we would get new urls all the time for the very same input data.
Explanation:
http://localhost/files/exposure/lima_large.json
http://filestorage/bucket/random_number/lima_large.json
http://localhost/files/exposure/lima_large.json
http://filestorage/bucket/different_number/lima_large.json
We then would not be able to see that both processes worked with the very same input. So we really need some kind of management to make sure that we don't process those files multiple times.
One easy strategy would just saving the original urls & the uploaded in a table. If we would re-process a url then, we would see that we already uploaded the file & can just re-use that url.
However, it has the problem that those data could change over time. Think on an WFS or WCS as the best example (layers stay the same but the underlying data changes).
So we could then add also checksums (sha1sum as git does). Then we also make sure that we keep track of those changes.
However, it has the consequence that it has to download the file in any case. It is a little bit annoying to do that with every processing step. Maybe we could use the content length header instead - and just rely on the fact that we either have a different url for changed outputs, or at least a different size from the HEAD request.
With my current understanding I suggest the following structure:
original_url | sha1sum | file_storage_url |
---|---|---|
http://public/url/file.ext | a3452dwd | http://filestorage/riesgosfiles/a3452dwd |
http://temporary1.ext | aaaaaaaa | http://filestorage/riesgosfiles/aaaaaaaa |
http://temporary1.ext | bbbbbbbb | http://filestorage/riesgosfiles/bbbbbbbb |
What the wrappers must do when the get an complex input reference (to make it easy in python):
def replace_url_if_needed(url):
# file_storage_urls is a set from the table with just the `file_storage_url` entries
# sha1sums is a dict from the table that maps the `sha1sum` to the `file_storage_url` entries
if url in file_storage_urls:
return url
content = requests.get(url).content
sha1sum = hexlib.sha1(content).hexdigest
if sha1sum in sha1sums.keys():
return sha1sums[sha1sum]
result_url = filestorage.upload(content)
store_in_db(url, sha1sum, result_url)
return result_url
For the complex output references we could just reuse the exact method (even that it would not be strictly necessary, as we would create normally completely new output files with different checksums).
I'm not that sure now, how to handle the case that the filestorage is not completely readable from the outside.
Maybe it makes sense to add endpoints for downloading the inputs & outputs then from the outside:
/complex-inputs/<id>/file
and /complex-outputs/<id>/file
.
Benefits:
Cons:
I also think in case we need this mechanism to download via the backend - then it makes also sense to extend the method to replace the urls, so that we also can handle links to the endpoints of /complex-inputs/<id>/file
and /complex-outputs/<id>/file
that then should point to the real url that they have in the db (those pointing to the file storage).
Expecially for the complex inputs I wonder if it makes more sense to also store the original url in the table - as it has important information about where this input comes from. Having only a url to the file storage (or our backend) doesn't give any idea about what kind of data it could be & where it is from.
(Things like https://rz-vm140.gfz-potsdam.de/wps/RetriveResult?id=abcd...
still doesn't say a lot, but at least we can see it is the output of a wps server @ GFZ - and not from AWI, DLR, or some other institution).
Expecially for the complex inputs I wonder if it makes more sense to also store the original url in the table - as it has important information about where this input comes from. Having only a url to the file storage (or our backend) doesn't give any idea about what kind of data it could be & where it is from.
(Things like
https://rz-vm140.gfz-potsdam.de/wps/RetriveResult?id=abcd...
still doesn't say a lot, but at least we can see it is the output of a wps server @ GFZ - and not from AWI, DLR, or some other institution).
Nevertheless it would be possible to extract that with the data structure that we have and that single one new table to map the urls.
I'm not that sure now, how to handle the case that the filestorage is not completely readable from the outside.
Maybe it makes sense to add endpoints for downloading the inputs & outputs then from the outside:
/complex-inputs/<id>/file
and/complex-outputs/<id>/file
.Benefits:
- easy to implement (loading the entry with the id, fetching the url & returning the content)
- dumping a collection would not need any more db lookup (just have the template for the url to the file endpoint)
- easy to restrict permissions if needed (based on the job & order ids)
Cons:
- two endpoints needed for basically the very same purpose (downloading files) & almost the same implementation
Mh, but the overall point of allowing only the download via the backend also makes the interaction with the WPS itself more difficult - as this must be able to fetch the data in any case (and without sending further headers).
Those just have to work from the outside - otherwise the WPS can't process the data at all.
Considering this, I see a larger problem with downloads via those /complex-inputs/<id>/file
and /complex-outputs/<id>/file
. endpoints. It doesn't make any sense for the WPS server perspective.
Regarding this, I also don't see really benefits in having an /download/<filename_in_filestorage>
endpoint.
The file storage itself will always be faster then the additional way via the backend. And as long as the bucket could be accessible for downloads wihtout additional credentials (no idea what the GFZ RZ policies will have here in the future), it just doesn't make any sense for permission management.
(Sorry for spamming you here, but this writing helps me on thinking & brainstorming...)
If we would like to make the bucket accessible, we would need a reverse proxy (nginx) that would include the minio endpoint with our bucket (but we also did that in the past already for the sensor management system).
Maybe some thoughts about the handling of the complex inputs urls when we send it to the wps server.
There are actually (at least) 2 ways:
Regarding to the download via the backend:
I think with nginx we can still use the minio client with a backend like url (like /api/v1/downloads/
then).
It would not be done by the backend (but the minio directly), but it will look like it is done by a backend endpoint - and could be replaced if needed.
(Still I don't really see a point anymore in this idea - as the WPS must to able to acess those data in any case without further authentification mechanisms).
Regarding to the download via the backend: I think with nginx we can still use the minio client with a backend like url (like
/api/v1/downloads/
then).It would not be done by the backend (but the minio directly), but it will look like it is done by a backend endpoint - and could be replaced if needed.
(Still I don't really see a point anymore in this idea - as the WPS must to able to acess those data in any case without further authentification mechanisms).
I think with the reverse proxy it should also work if we would use an external file storage service (say s3.gfz-potsdam.de).
def replace_url_if_needed(url):
file_storage_urls is a set from the table with just the
file_storage_url
entriessha1sums is a dict from the table that maps the
sha1sum
to thefile_storage_url
entriesif url in file_storage_urls: return url content = requests.get(url).content sha1sum = hexlib.sha1(content).hexdigest if sha1sum in sha1sums.keys(): return sha1sums[sha1sum] result_url = filestorage.upload(content) store_in_db(url, sha1sum, result_url) return result_url
Even if the probability with file collisions with sha1sum is really low, I guess I should still consider the original url before I reuse a file storage url.
The has the downside that I may store the very same file another time in the file storage, but as it then uses a different origin, it could be meaningful to do so nevertheless - just to point to the different original url.
The first try is implemented in https://github.com/riesgos/async/pull/33
(But honestly I expect that it still has quite a lot of bugs & situations in that it doesn't work properly).
Closing as minio is now nicely integrated. Feel free to re-open if discussion is still required.
Make sure that data-model is compatible with file-storage
Note: wrapper must write files into file-storage. That means wrapper must know about file-storage's existence. (But wrapper also already must know about database's existence)