kids-first / kf-api-dataservice

:file_cabinet: Primary API for interacting with the Kids First data
http://kf-api-dataservice.kidsfirstdrc.org
Apache License 2.0
5 stars 2 forks source link

I believe we're using gen3/indexd wrong #487

Open fiendish opened 5 years ago

fiendish commented 5 years ago

Indexd is a permalink index. It is supposed to keep a persistent reference to a file wherever the file happens to be. It does this by associating the [hash & file size] of the registered file with a new unique identifier.

Expected usage of indexd comes from the description for Sheepdog and a gen3 forum response on the subject of deleting IDs:

Sheepdog checks if the file being submitted has a hash & file size that match anything currently in indexd and if so uses the returned document GUID as the object ID reference. If no match is found in Indexd then a new record is created and stored in Indexd.

I think in general we assume Data GUIDs are never deleted. If the underlying data gets retracted the Data GUID would simply remove the associated urls to the actual files but remain as a valid Data GUID. (https://forums.gen3.org/t/deleting-data-guids/34/2)

If the same file is ever uploaded again, it is supposed to be able to be linked to the same ID as before. Indexd should reliably map GUID->BLOB_DATA forever, and if we can't (or don't want to) do that then there's no point in using indexd in the first place.

This means that entries should never be deleted from indexd for any reason, because if an entry is deleted then subsequent introduction of the file would not reconnect to the previously established ID. Redaction and other forms of access control need to happen at a layer other than ID existence.

So we're currently making two rather large errors:

1) We don't generate stable content-based hashes of the files. S3 ETags are unique instance identifiers on S3, but notably are not stable content-based hashes, because they are generated differently depending on how the file is uploaded and which S3 encryption option is used. Checking the index for something that isn't a stable content-based hash violates most (all?) of the purpose of Gen3, because introduction of the file on another platform or even with just different S3 settings will not transfer the original ID unless we also forever transport the original non-standard Etags around with the files, which would be a bad design principle requiring forever preserving a significant bit of arcane knowledge about historical shortcuts.

2) We delete IDs from the index (upon removal of an attached entity from the dataservice) when we're not supposed to do that. Lack of presence in the dataservice, momentary or otherwise, should not disrupt stably mapping GUID->BLOB_DATA. Currently subsequent reintroduction of a file after removal of its associated GF entity from the dataservice will not get assigned the original ID, resulting in broken references.

Fixing (2) will also make it sane to delete and reload studies.

fiendish commented 5 years ago

Another consequence of this is that gen3 can't keep up with parallel edits and starts throwing 500 errors.

See the apps-prd/dataservice-api cloudwatch logs from 2019-03-15

dankolbman commented 5 years ago

@fiendish ☹️

This is going to be a problem no matter, probably more so if we pull out all the data into our database and only load it in batch upon release.

fiendish commented 5 years ago

I think we need to remove all interaction between the dataservice GF and gen3. Make the dataservice GF POST/PATCH only take the file's gen3 uid which must already exist and do not return size/hash/name or any other forwarded data from gen3 on a GF GET request to the dataservice. Do not delete from gen3 when deleting a GF from the dataservice. We can keep putting the file path into external_id like we mostly currently do already which is needed for local testing anyway because local dataservice instances don't use gen3 at all.

If portal ETL needs file sizes etc, they can ask gen3 directly.

fiendish commented 5 years ago

Compare also

(genomic-files endpoint)

time curl https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?study_id=SD_JWS3V24D real 0m29.380s

vs

(diagnoses endpoint)

time curl https://kf-api-dataservice.kidsfirstdrc.org/diagnoses?study_id=SD_JWS3V24D real 0m0.209s

or

(sequencing-experiments endpoint)

time curl https://kf-api-dataservice.kidsfirstdrc.org/sequencing-experiments?study_id=SD_JWS3V24D real 0m1.234s

dankolbman commented 5 years ago

Were you hitting all of these from the VPN?

fiendish commented 5 years ago

Were you hitting all of these from the VPN?

yes

dankolbman commented 5 years ago

I see, it wouldn't have made a difference anyway, I thought you were also testing indexd directly. This is already known as an issue and is made somewhat better by #448

fiendish commented 5 years ago

Listing may be made somewhat better by PR #448, but the significant delay on doing anything to any GF won't be improved. I sent PATCH to the GF endpoint over 30,000 times today, and I had to do it sequentially because of https://github.com/kids-first/kf-api-dataservice/issues/487#issuecomment-473942739. It took hours to complete instead of minutes.

And PR #448 doesn't address the primary problem https://github.com/kids-first/kf-api-dataservice/issues/487#issue-391185180 that externally-facing file download links will get broken all the time when they're never supposed to break. That's a big deal.

I mean...I can tell my code to just keep retrying when the server errors, even though I think it shouldn't error, and I can tell my code to only ever send one request at a time to reduce the incidence of said errors, and then I can start it running and walk away for a few hours, or a few days, or a few weeks. It's annoying, and I shouldn't have to do that because I'm not actually desiring to interact with gen3 in any way at all when I hit the GF endpoint, but at least it doesn't break the outside world. Failing to put the "perma" in "permalink", however, has potentially terrible consequences for the project's reputation.

I also think that fixing this issue as suggested in https://github.com/kids-first/kf-api-dataservice/issues/487#issuecomment-473947983 (by completely decoupling the dataservice from gen3 and no longer proxy forwarding file size/hashes metadata) means that PR #448 becomes obsolete.

dankolbman commented 5 years ago

In order to move forward with this, we need a plan for the following: 1) Add fields that are in indexd to dataservice and remove indexd integration 2) Migrate existing data from indexd onto the new fields in dataservice 3) Write ETL to sync dataservice to indexd. I would suggest doing this during a release via a task service with the caveat that we would have to introduce a dependency chain into the coordinator.

I presume the ETL process will update dataservice files with some indexd uuid for the file, if not, we may have to make changes to the Portal ETL to extract from indexd.

fiendish commented 5 years ago

Add fields that are in indexd to dataservice and remove indexd integration Migrate existing data from indexd onto the new fields in dataservice

Or just not. There's no great reason for the dataservice to reflect what indexd stores.

We should:

1) register files with gen3 once, unrelated to GF entry changes. I hear this can be done automatically by S3.

2) During POST/PATCH to the dataservice genomic-files endpoint, say which gen3 UID to use for the given GF entity. The dataservice should store only the gen3 UID, and anyone who needs more data about the file (definitely not us) should go to gen3. For debugging only, we can continue to put the file path into external_id.

fiendish commented 5 years ago

Possibly adding more weight to getting this done and over with, apparently we don't handle gen3 errors humanely in the dataservice either.

Note: I do not want us to fix this error handling. I want us to do https://github.com/kids-first/kf-api-dataservice/issues/487#issuecomment-489681015 . This is just more information about how fragile the gen3 interface is.

See what passing an invalid md5 checksum (it was an S3 etag which is often not an md5 of the contents) looks like in our logs:

18:13:31
2019-05-07 18:13:31,600 DEBUG Starting new HTTPS connection (1): data.kidsfirstdrc.org:443
18:13:31
2019-05-07 18:13:31,628 DEBUG https://data.kidsfirstdrc.org:443 "POST /index/index/ HTTP/1.1" 400 304
18:13:31
[2019-05-07 18:13:31,629] ERROR in app: Exception on /genomic-files/ [POST]
18:13:31
Traceback (most recent call last):
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
18:13:31
response = self.full_dispatch_request()
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
18:13:31
rv = self.handle_user_exception(e)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
18:13:31
reraise(exc_type, exc_value, tb)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
18:13:31
raise value
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
18:13:31
rv = self.dispatch_request()
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
18:13:31
return self.view_functions[rule.endpoint](**req.view_args)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 88, in view
18:13:31
return self.dispatch_request(*args, **kwargs)
18:13:31
File "/app/dataservice/api/common/views.py", line 194, in dispatch_request
18:13:31
resp = super(CRUDView, self).dispatch_request(*args, **kwargs)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 158, in dispatch_request
18:13:31
return meth(*args, **kwargs)
18:13:31
File "/app/dataservice/api/genomic_file/resources.py", line 111, in post
18:13:31
db.session.commit()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/scoping.py", line 157, in do
18:13:31
return getattr(self.registry(), name)(*args, **kwargs)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 921, in commit
18:13:31
self.transaction.commit()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 461, in commit
18:13:31
self._prepare_impl()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 441, in _prepare_impl
18:13:31
self.session.flush()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2192, in flush
18:13:31
self._flush(objects)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2312, in _flush
18:13:31
transaction.rollback(_capture_exception=True)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
18:13:31
compat.reraise(exc_type, exc_value, exc_tb)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
18:13:31
raise value
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2276, in _flush
18:13:31
flush_context.execute()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
18:13:31
rec.execute(self)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
18:13:31
uow
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 156, in save_obj
18:13:31
base_mapper, states, uowtransaction
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 289, in _organize_states_for_save
18:13:31
mapper.dispatch.before_insert(mapper, connection, state)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__
18:13:31
fn(*args, **kw)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/events.py", line 621, in wrap
18:13:31
fn(*arg, **kw)
18:13:31
File "/app/dataservice/api/common/model.py", line 151, in register_indexd
18:13:31
return indexd.new(target)
18:13:31
File "/app/dataservice/extensions/flask_indexd.py", line 127, in new
18:13:31
resp.raise_for_status()
18:13:31
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
18:13:31
raise HTTPError(http_error_msg, response=self)
18:13:31
requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: https://data.kidsfirstdrc.org/index/index/
18:13:31
[2019-05-07 18:13:31 +0000] [49] [ERROR] Error handling request /genomic-files/
18:13:31
Traceback (most recent call last):
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
18:13:31
response = self.full_dispatch_request()
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
18:13:31
rv = self.handle_user_exception(e)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
18:13:31
reraise(exc_type, exc_value, tb)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
18:13:31
raise value
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
18:13:31
rv = self.dispatch_request()
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
18:13:31
return self.view_functions[rule.endpoint](**req.view_args)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 88, in view
18:13:31
return self.dispatch_request(*args, **kwargs)
18:13:31
File "/app/dataservice/api/common/views.py", line 194, in dispatch_request
18:13:31
resp = super(CRUDView, self).dispatch_request(*args, **kwargs)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/views.py", line 158, in dispatch_request
18:13:31
return meth(*args, **kwargs)
18:13:31
File "/app/dataservice/api/genomic_file/resources.py", line 111, in post
18:13:31
db.session.commit()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/scoping.py", line 157, in do
18:13:31
return getattr(self.registry(), name)(*args, **kwargs)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 921, in commit
18:13:31
self.transaction.commit()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 461, in commit
18:13:31
self._prepare_impl()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 441, in _prepare_impl
18:13:31
self.session.flush()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2192, in flush
18:13:31
self._flush(objects)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2312, in _flush
18:13:31
transaction.rollback(_capture_exception=True)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
18:13:31
compat.reraise(exc_type, exc_value, exc_tb)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
18:13:31
raise value
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2276, in _flush
18:13:31
flush_context.execute()
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
18:13:31
rec.execute(self)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
18:13:31
uow
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 156, in save_obj
18:13:31
base_mapper, states, uowtransaction
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 289, in _organize_states_for_save
18:13:31
mapper.dispatch.before_insert(mapper, connection, state)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__
18:13:31
fn(*args, **kw)
18:13:31
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/events.py", line 621, in wrap
18:13:31
fn(*arg, **kw)
18:13:31
File "/app/dataservice/api/common/model.py", line 151, in register_indexd
18:13:31
return indexd.new(target)
18:13:31
File "/app/dataservice/extensions/flask_indexd.py", line 127, in new
18:13:31
resp.raise_for_status()
18:13:31
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
18:13:31
raise HTTPError(http_error_msg, response=self)
18:13:31
requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: https://data.kidsfirstdrc.org/index/index/
18:13:31
During handling of the above exception, another exception occurred:
18:13:31
Traceback (most recent call last):
18:13:31
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
18:13:31
self.handle_request(listener, req, client, addr)
18:13:31
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
18:13:31
respiter = self.wsgi(environ, resp.start_response)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2309, in __call__
18:13:31
return self.wsgi_app(environ, start_response)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2295, in wsgi_app
18:13:31
response = self.handle_exception(e)
18:13:31
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1748, in handle_exception
18:13:31
return self.finalize_request(handler(e), from_error_handler=True)
18:13:31
File "/app/dataservice/api/errors.py", line 47, in http_error
18:13:31
code = e.code
18:13:31
AttributeError: 'HTTPError' object has no attribute 'code'
18:13:31
10.12.11.46 - - [07/May/2019:18:13:31 +0000] "POST /genomic-files/ HTTP/1.1" 500 141 "-" "curl/7.54.0"
dankolbman commented 5 years ago

Add fields that are in indexd to dataservice and remove indexd integration Migrate existing data from indexd onto the new fields in dataservice

Or just not. There's no great reason for the dataservice to reflect what indexd stores.

If we don't migrate back to the dataservice we'd have to instead update the portal ETL code to pull genomic files from indexd.

fiendish commented 5 years ago

we'd have to instead update the portal ETL code to pull genomic files from indexd

That is exactly what should happen (except replace "we" because we don't maintain the portal ETL).