vubiostat / redcapAPI

R interface to REDCap (http://www.project-redcap.org/)
21 stars 27 forks source link

Code Book Text Fields Not exporting as labels (exportRecordsTyped) #270

Closed spgarbet closed 1 year ago

spgarbet commented 1 year ago

The REDCap server support text entry from various code books

The new interface via exportRecordsTyped does not support having the labels for these fields come across. These code books that are utilized are huge and it would be better done on the server side.

Open Question: What is the interface design change to support it? The default cast functions should still work as intended (difficult!?)

I suspect it requires some tweaks to the round trip call options and recognition of the special case in the code without distrubing the other normal casting functions.

An example of meta data for a field:

select_choices_or_calculations: BIOPORTAL:RXNORM
file_type: text
spgarbet commented 1 year ago

@nutterb This is a bit of a rush job on our end (desired before Thanksgiving break for a clinical trial analysis). I'd like your opinion and we can allocate some resources on our end if needed for implementation.

spgarbet commented 1 year ago

@obregos Can you set up the sandbox REDCap project to have one of these fields and create a few records to play with?

obregos commented 1 year ago

Yes, I've added the field daily_med and some test records: https://redcap.vanderbilt.edu/redcap_v13.11.1/DataEntry/index.php?pid=167416&page=survey_form&id=4&event_id=436738&instance=1&msg=edit

nutterb commented 1 year ago

My first thought is... Yikes! :)

This is effectively the same problem we face with sql fields. I point this out because whatever we do here might be something we can adapt to those fields.

I can think of two ways to attempt this. For both of them, I would recommend

  1. When identifying field types, these text fields should receive a field type of bioportal
  2. Add a bioportal field type to the casting list (we don't actually have a default casting for text fields, so giving a casting override to these fields will a nuisance to the user if it affects all of the other unvalidated text fields, too).

Option 1: Pass through and Call API

In this option, we let the bioportal fields pass through casting untouched. After the casting step, make a separate call to the API for the labeled output. Merge the labeled data back into the data and remove the coded values.

This is probably the easier solution on its face, and I would guess more efficient than the next option (especially if there are multiple bioportal fields in a project). BUT, it's going to be harder to negotiate things like castLabel vs castLabelCharacter and matching the style output to the user request.

Option 2: Focus on getCodings

In this option, when you are building the codings index inside exportRecordsTyped, for a bioportal field, you make the necessary API calls to get the coded and labeled values. Theoretically, we have the coded values already in the Raw data, though I don't recall if we pass those to getCodings. So we either have to feed getCodings the raw values and match them to the labeled values from the second API call, or make two more API calls to get the coded and labeled values.

Either way, it's at least one extra API call per bioportal variable (option 1 could get the labeled values from all bioportal fields in a single call), which is going to be a hit to efficiency. I'm guessing most users will accept the extra delay to be able to get the labeled values without any effort on their part, so it's likely not a big concern.

The bigger advantage to us is that if we focus on pulling the codings, we don't have to make any tweaks to the casting functions, and castCode, castLabel, castCodeCharacter, and castLabelCharacter will all return the desired types.

I lean a bit toward this idea, but will caution that it may be a day or two slower to try to develop than Option 1.

nutterb commented 1 year ago

If going with option 2, these are the places where development would need to focus. On a quick scan, it looks like the changes would integrate seamlessly into recastRecords and castForImport. I can't say that for Option 1, which kind of seals the deal for me. I recommend Option 2.

.castRecords_getFieldtypes: https://github.com/vubiostat/redcapAPI/blob/main/R/fieldCastingFunctions.R#L594

.castRecords_getCodings: https://github.com/vubiostat/redcapAPI/blob/main/R/fieldCastingFunctions.R#L634

spgarbet commented 1 year ago

I like option 2 as well. I had both ideas as a vague notion. I had a third fleeting idea that we pull the codebooks into the library, then I saw the size of them and noped out of that one. The downside of option 2 is castForImport will only know about types that have been used in the data. I think this is a fair trade off, we can throw an error if a user provides a value that the library has not seen and hopefully point to documentation that shows the problem. I think this is most useful on the export side and don't think it will be a big issue on the castForImport side. The performance hit is worth the gain.

nutterb commented 1 year ago

It occurred to me that what we are scheming is not going to work for offline connections.

If we write a method that exports the mappings and stores caches them on the connection object, there is a good possibility of extending it to offline connections without altering our plans too much.

Big question: does this need to be available to offline connections before the Thanksgiving deadline? Or can we assume API access and figure out the offline piece later?

spgarbet commented 1 year ago

Offline isn't needed. Let's treat that as a separate enhancement. Should at least have a decent error message for that.

nutterb commented 1 year ago

I'd like to have someone look over https://github.com/vubiostat/redcapAPI/tree/issue-270-bioportal and give it a go on some more robust data.

Things that haven't been done

  1. unit tests for exportExternalCoding
  2. unit tests for externalCoding caching
  3. Test that warning is displayed for offlineConnections with bioportal fields
  4. Bioportal field casting for castLabel, castCode
  5. Better documentation of the bioportal field type in fieldValidationAndCasting or exportRecordsTyped (I'm not sure where it fits best)

I have a feeling that an error could occur if the bioportal codings get cached and then a either a field name changes or a new bioportal field is added and the cache isn't refreshed. It could be either an error or it may just silently skip the field. I haven't spent much time exploring that.

I set a default of batch_size = 1000 in exportBioportalCodings. I suspect it will be used internally most of the time and was worried that setting the batch size would be inconvenient. That seemed like a reasonable default for now and we can work out if it needs to be managed differently later.

And if this satisfies the need, the future offlineConnection development will need

Last bit of good news: this will definitely work with SQL fields.

spgarbet commented 1 year ago

The code looks sound. I like the encapsulation of concern.

I have a feeling that an error could occur if the bioportal codings get cached and then a either a field name changes or a new bioportal field is added and the cache isn't refreshed. It could be either an error or it may just silently skip the field. I haven't spent much time exploring that.

From the code reference side this happens rarely. From the REDCap project side which we're relying on, this is a limitation. I don't think there is a complete solution.

I set a default of batch_size = 1000 in exportBioportalCodings. I suspect it will be used internally most of the time and was worried that setting the batch size would be inconvenient. That seemed like a reasonable default for now and we can work out if it needs to be managed differently later.

Would it be possible to use the batch_size specified by the user?

Offline pieces can be a separate issue.

:100: on the SQL solution.

nutterb commented 1 year ago

I have a feeling that an error could occur if the bioportal codings get cached and then a either a field name changes or a new bioportal field is added and the cache isn't refreshed. It could be either an error or it may just silently skip the field. I haven't spent much time exploring that.

From the code reference side this happens rarely. From the REDCap project side which we're relying on, this is a limitation. I don't think there is a complete solution.

Realistically, this shouldn't happen in a production project. it's also a problem that goes away as soon as you restart the session or reinitialize the connection (or any action that flushes the cache). I haven't explored it or tried to make it happen. Just making it known that it will be extremely confusing when it inevitably pops up.

I set a default of batch_size = 1000 in exportBioportalCodings. I suspect it will be used internally most of the time and was worried that setting the batch size would be inconvenient. That seemed like a reasonable default for now and we can work out if it needs to be managed differently later.

Would it be possible to use the batch_size specified by the user?

I guess? exportExternalCodings (I renamed it to feel applicable to SQL as well) isn't directly called by the user and the only place it currently gets called is within the caching methods. So I think it would end up looking like rcon$externalCoding(batch_size = batch_size_from_exportRecordsTyped). Feels weird. But theoretically doable.

spgarbet commented 1 year ago

Caching inevitably leads to confusion at some point. I think it's the inherited downside.

nutterb commented 1 year ago

Well, this is interesting...

I tried uploading data to a bioportal field without creating the field in the UI. That is

I wasn't getting the results I expected. Both labels and codes were coming back as the coded values. So I went to the UI to see what it looked like there. As soon as I opened the form in the UI, labels started getting associated.

In this step here, you'll see record 3 does not have a label for bioportal_test

> Rec <- exportRecordsTyped(rcon, 
+                           fields = "bioportal_test", 
+                           cast = list(system = castRaw))
> Rec[c("record_id", "redcap_event_name", "bioportal_test")]
   record_id redcap_event_name
1          1     event_1_arm_1
2          2     event_1_arm_1
3          3     event_1_arm_1
4          4     event_1_arm_1
5          5     event_1_arm_1

                                           bioportal_test
1                   abatacept Prefilled Syringe [Orencia]
2                                tripeptide-10 citrulline
3                                                  217928  # <-------------------- SEE THIS HERE
4  shad scale pollen extract 50 MG/ML Injectable Solution
5                                            Noxafil Pill

Now I go to the form for record 3 and after viewing it in the form, I run the following. Notice that the bioportal_test value is now the labeled value for record 3.

> # Now I go and open record 3 in the UI
> rcon$refresh_externalCoding()
> Rec <- exportRecordsTyped(rcon, 
+                           fields = "bioportal_test", 
+                           cast = list(system = castRaw))
> Rec[c("record_id", "redcap_event_name", "bioportal_test")]
   record_id redcap_event_name
1          1     event_1_arm_1
2          2     event_1_arm_1
3          3     event_1_arm_1
4          4     event_1_arm_1
5          5     event_1_arm_1

                                           bioportal_test
1                   abatacept Prefilled Syringe [Orencia]
2                                tripeptide-10 citrulline
3                                              Kristalose # <--------------- AND NOW IT'S DIFFERENT
4  shad scale pollen extract 50 MG/ML Injectable Solution
5                                            Noxafil Pill

This doesn't bode well for automated testing. I'm going to skip testing the output from these and ask your group to try it out on a project that has some bioportal fields. I'll write any other tests I can. If anyone can come up with an idea of how to get this working with imported data, I'll come back to it.

spgarbet commented 1 year ago

This indicates that the association to the code books is happening in javascript of the form in REDCap. Not on the API export. Which is weird because it always stores the raw value. The export mapping doesn't occur unless it was opened in a web form?

nutterb commented 1 year ago

Yeah, it's weird.

As far as I can tell, any record that I've opened and viewed in the UI at least once will provide labels even if I alter the underlying code via importRecords.

But importing a new field and populating the data is not cooperating.

I've added the field type and the data to the testing data. I'm about to run the test suite. Once I push my changes, there will be something for experimentation. I'm hoping this doesn't turn into an issue for the project that needs this soon.

nutterb commented 1 year ago

If I'm extrapolating from this correctly, it does appear to be javascript thing where, without the UI interaction, the label for the code doesn't get cached on the local REDCap instance.

Change/improvement: Users are now permitted to import data values for Biomedical Ontology fields. In previous versions, this was not allowed and would return an error message when a user attempted this. The Data Import Tool now displays a warning (instead of an error) that informs users that importing data for such fields is allowable but is not recommended because the value might not display correctly if viewed afterward on a report or in the data entry interface, in which this is caused by the fact that the label is missing because it has not been fetched via the BioPortal web service and then cached in REDCap's database tables. (Ticket

8096)

Maybe the impact could be mitigated by having the institution cache values preemptively? I'm not sure what the terms are with the BioPortal service, but this appears to be out of our hands. I've got tests done and check is passing, so I'm going to go ahead and push it for review.

spgarbet commented 1 year ago

I'm going to open a support request on the mapping not happening.