Illegal data dictionaries

berryld1 commented 9 months ago

(Hmm, the HTML tags disappear...) Hey! Me again.

I know we've covered this in various fashions - the issue of illegal characters in data dictionaries, which results in a metadata export of "NA", which then makes it impossible to import or export any data (no fields are recognized as being valid fields).

Two of my projects that run automated ETL scripts broke this morning - and I think must be due to a dd update that introduced illegal characters, because when I run locally, I get "NA" data dictionary out.

I did not design nor do I maintain these redcaps, so, at the mercy of the client as far as that's concerned.

For both of the projects that broke, I ran them through cleanseMetaData() to get the clean DD, then diffed that with the original DD to find which fields R changed.

In one project, just one field.

In the other project, nearly 100 fields.

I'm sending these back to the client to fix, but hoped you guys might have some guidance.

The common theme always seems to be HTML tags and a space.

So, the "dirty" dd will have something like:

(I don't know how to make the HTML tags not disappear, so I'm going to us left sideways carat = ( right sideways carat = ) "blah-blah (p) (/p) blah-blah"

, and the corresponding clean dd will be:

"blah blah (p)(/p) blah-blah"

So, the only change is closing that space between (p) and (/p)

So, that's pretty horrible to have to review and fix, ESPECIALLY because you can't always just close the space, for example:

"you must (em)always)(/em)", and R wants to do:

"you must(em)always(/em)".

With that background, finally, my question: Do you think this is maybe non-breaking white space that R doesn't like? Or, if not, what? A legit space does not seem like it should wreak so much havoc. Any suggestions for how to avoid? Would R be happier (or less happy) with & n b s p ;<

so, the non-breaking white space (if that's what it is) is just some characters vs. some weird non-character thing invisible to the naked eye?

couthcommander commented 9 months ago

You can type less than (<) or greater than (>) by typing ampersand followed by "lt;" or "gt;".

My guess is the "space" is not a space, and you'd need to ask the client what they intended for the ~100 fields.

I'm not sure what you mean by R wants to do "you must<em>always</em>". Are you writing code that converts the original to that? We might be able to find a regular expression that would work, though I still suspect you don't actually want to just remove the "space".

berryld1 commented 9 months ago

Thanks, Cole.

Wow, sometimes I'm just astonished / appalled at how my brain works in silos. I suddenly don't know the name for "greater than" / "less than" symbols because in the context of an HTML tag.

Anyhow! Let me try to clarify.

When data dictionaries have characters in them that R doesn't like, the data dictionary export fails, and you get something like this:

This is true whether you use exportMetaData() or what the redcapConnection gets out and caches:

The practical consequence of this is that redcapAPI is now dead in the water - absolutely everything it does validates through the metadata, so if you can't get the metadata out, you can't do anything.

To find what R / redcapAPI doesn't like, you can pull the DD via the GUI and run cleanse_metadata(). I have this pipelined in a script to then diff the cleaned against the dirty. And these are the sorts of differences I see: oh! here's one that doesn't involve an html tag at all:

here's more the thing I typically see:

So, for real, no, there's not supposed to be anything here but for the space - but something happens somewhere between Excel (dd) - redcap - redcapAPI/R - where an innocent space becomes a character that is:

invisible to the naked eye
breaks redcapAPI
can not be addressed simply by running the dd through cleanseMetaData() because in many cases, a real honest-to-goodness space should be there

I should further add - this is NOT happening in the redcap API (the API). If I run the call in the API playground, I get the DD out. And there you can see the characters that are probably causing the problem, which redcap renders as diamond-?:

The same thing happens sometimes with data - I'm sometimes unable to get some records out using exportRecordsTyped(), but if I write the API call natively, I can get the record out.

So, when redcapAPI (the R package) hits these characters in any context, it just craps out - which, my understanding from Shawn is that's because R doesn't know how to handle them. So, when I say "what R wants to do", what I really mean is, what cleanseMetaData() wants to do to make R happy.

Here's a good example where you can see it's unlikely this illegal character is intended to mean anything - the dd designers just want a space there. Which is why I say there seems to be part of this that's an Excel - redcap thing - somehow, a dd submitted with an innocent space, redcap interprets as uninterpretable, and then passes that problem along to R.

But we can't control the Excel-redcap side of things. And if we tell redcap we can't get these dds or records out, they'll say, we can, because you can, if you write the call natively. You can't with redcapAPI package because there's a lot more stuff going on there, where R actually has to handle the data/metadata, and refuses to do it with illegal characters.

Does that help?

I know I keep asking this question in different ways and different contexts - because I keep running into it.

And I'm not sure it's a practical answer for me to send a diff file of 100 fields to the client and say - "figure out how to make this not have illegal characters!"

@spgarbet - would it be possible to have cleanseMetaData replace an illegal character with a space, instead of empty string? Not sure if redcap would turn right around and make it back into an illegal character, that part is utterly mysterious.

oh! It suddenly occurs to me - likely, a lot of these fields are originally built in the designer, in "rich html" mode, and that's what creates these weirdnesses, vs. a space in an excel file suddenly turning into a computing pariah. I don't know, I guess.

couthcommander commented 9 months ago

Perfect, thank you for clarifying. You sound spot on, this is likely a consequence of using the designer, the intention is that this should be a space, and cleanseMetaData should handle this better.

Edit: I forgot that cleanseMetaData is being discontinued so this will likely be solved elsewhere

spgarbet commented 9 months ago

Is this with version 2.8.4? The new version should be filtering those out cleanly.

berryld1 commented 9 months ago

Yes, 2.8.4

"Filtering those out cleanly" = filtering what out?

spgarbet commented 9 months ago

The weird characters. They should come over in the meta data without issue.

I'm trying the copy you gave me and having no issues with exporting.

spgarbet commented 9 months ago

You were installing direct via devtools::install_github. I think you have an interim version before all the bugs were ironed out. Could you do me a favor and reinstall? The current version on CRAN is the latest.

berryld1 commented 9 months ago

Yes, I will reinstall. However, this complicates assessment of the issue:

In other words - I, too, can get the DD out of the dev copy. I've now downloaded-uploaded PROD to DEV three times to make sure I didn't accidentally load back the DEV dd or forget to click "Commit" or whatever. Same result each time.

Which makes the solution for Anne (possibly) simple - download-upload the DD (??)

But makes testing difficult because I can't give you access to PROD. Anne is very strict about this.

berryld1 commented 9 months ago

Hmm, usually it tells me "MD5 sums checked and downloaded binary packages blah-blah" or something like that: