mdsol / rwslib

Provide a (programmer) friendly client library to Rave Web Services (RWS).
MIT License
30 stars 13 forks source link

Performance and Load Tips #68

Open ghost opened 8 years ago

ghost commented 8 years ago

Do you have any tips or ideas concerning the various possible data consumption patterns enabled by RWS?

For example, I am faced with 10 distinct eCRFs (forms) that contain data relevant to me. They are similar, structurally, but not identical. To process the data I need, I could:

I think any of those patterns could be made to work for me, though maybe some are more elegant than others, but I'd like to have minimal impact on the environment. I'm given to understand the RWS service uses the same computing resources as the regular browser UI, and I cannot take away any responsiveness for the end users. I'm unaware of any benchmarks about how RWS use affects the users though. Still you can imagine that if we have 100 subjects, each of which has, possibly, 10 forms worth of data, you end up with quite a few service calls in a fairly short time.

My guess is that using the FormDataRequest is going to be most efficient, probably using the XML output , given that this pattern appears to use the fewest service calls. There's a lot I don't know about Medidata internals, caching and optimization so I could easily be wrong (e.g. the database indexing does not support Form-oriented lookups as well as subject-oriented lookups so the form lookups end up being much more expensive).

Any advice? I can't necessarily just delay until the dead of night... OR, is this all unnecessary fretting? Should I stop worrying as long as I keep the service calls down to a dull roar?

isparks commented 8 years ago

Hi John. A bit of background first. Rave is primarily a transactional system with what I would describe as a follow-along reporting subsystem called Clinical Views. The Clinical View system scans for changes in the transactional tables and pulls them into a de-normalized reporting table. The data entry to available for reporting lag is generally low but if there is a lot going on you can get longer delays, up to hours. The services you are talking about all read from these clinical views. A header included with any response tells you the last time the views were updated which can help with the scheduling of further calls. i.e. if you see that a clinical view has not been updated in 4 hours you might choose to repeat a request at intervals until you get more up-to-date data.

Since all the approaches you have suggested read from these clinical views the load on Rave should not be significant, you are making requests from what are essentially materialized views in the database, pre-computed extract tables. These views are organized by form so a subject-level request or a study level request has to do a lot more joins. That said, for a very large study requesting all data for a particular form in a single hit could get you a timeout if it takes more than 1 hour to stream the data. If you think the studies won't become so large that you'll hit this limit then that may be the most efficient, otherwise by subject would likely be safer and you can do by subject/form combination but that may leave you a lot of requests.

There are ways to get just incremental changes from a last date/time but the client has to set up their clinical view configuration in a certain way and that is not guaranteed so I would not rely on it.

Lastly, you could combine the ClinicalAuditRecord dataset with these requests. You'd poll the ClinicalAuditRecords service for data changes (see the AuditEvent sub-project in the extras directory of rwslib) to detect changes in subjects/forms and then on your polling interval you could request those forms/subjects that you know have had changes which could reduce the total number of requests. Bear in mind my caveat about the update timestamps on the clinical views, the audits will be written before the clinical views reflect those changes.

But my advice overall, from knowledge of how the system works rather than really extensive experience of using these various services, is to do by-form if you know data volumes are not hundreds of thousands of records or by subject otherwise.

I know the above doesn't give you a definitive answer but hopefully helps to understand the risks and tradeoffs.

vagarwal77 commented 3 years ago

I am looking to see sample filled eCRFs (forms) for the reference and better understanding. Please suggets from where i can get the same?

iansparks commented 3 years ago

@vagarwal77 a google image search for "medidata rave eCRF" will provide a lot of example screenshots. I don't think we can help you further than that.