Add either a function or a data.frame in the package to get the list of respondent metadata columns

chrisumphlett commented 2 years ago

I would like to be able to have the package provide a global list of the 17 standard metadata columns (plus browser meta info which comes through like response data but should be treated like metadata IMHO), rather than me having them hardcoded. Perhaps others have them hardcoded too; if/when Qualtrics adds another, we'll all need to update our list. If the package holds it then we could automagically have it update whatever we're doing in our code.

Why, you might ask? Here's a 3:42 video with context on why I have a hardcoded list today, and why you might want to have one as well.

re: the missing metadata I refer to in the video, I was wrong. Qualtrics labels it as "Response Type" but the column is "Status" and that is included.

I can work on this but am curious to hear how those of you who have managed the package long-term feel about including this at all, and if so, whether you'd want it as a function, data.frame, or something else.

Relevant links from Qualtrics' docs:

juliasilge commented 2 years ago

I could see that being useful! I lean toward adding it as a dataframe? Could there be any automated way for us to regularly check (like with CI) that the info is not outdated?

chrisumphlett commented 2 years ago

Maybe from this? https://api.qualtrics.com/02a178db2ab5b-get-schema-response

And perhaps it could be instead a way to get all of the "dataTypes" for a survey rather than just having the list of metadata. It even has the display order stuff. But the browser meta info isn't included

chrisumphlett commented 2 years ago

Also response quality columns: https://www.qualtrics.com/support/survey-platform/survey-module/survey-checker/response-quality/

eg, qualtrics_data_quality_cols <- c("q_datapolicyviolations", "q_recaptchascore", "q_relevantidduplicatescore","q_ballotboxstuffing", "q_ambiguoustextpresent", "q_unansweredPercentage", "q_unansweredquestions", "q_straightliningCount", "q_straightlining_Percentage", "q_straightliningquestions")

jmobrien commented 2 years ago

Hi @chrisumphlett, just saw your video. You got one thing wrong--you said it was "okay" that requests to exclude the metadata and questions didn't actually exclude them. It wasn't okay--it was my mistake.

Sorry about that! But I think you had the right idea about an approach, and I think it should work now.

(FYI @juliasilge requests for "exclude all types of this variable" specifically were being stripped out of the request body due to a subtlety I didn't catch, and didn't properly write a test for. Fixed now, plus added a suitable test.)

So, @chrisumphlett, if you want just the metadata & embedded data for one table, you should be able to do it this way:

fetch_survey(
  "[surveyid]",
  include_questions = NA
)

Similarly, for just the questions (keeping response ID to pivot/merge on):

fetch_survey(
  "[surveyid]",
  include_metadata = "ResponseId",
  include_embedded = NA
)

Does that look right for what you need?

I still tend to agree with your thought that having something that more automatically stays up-to-date with the API schema would be better, but I hopefully this can serve well enough for now.

chrisumphlett commented 2 years ago

Thanks Jim. This is definitely an improvement, not perfect (through not fault of yours). Qualtrics data model is not well-aligned with the categories available. When I run the call to exclude questions, I get some things I don't want, and don't get things I do want.

Things that are included

Calculated columns, like the EN translation of an open-ended question, Topics that users have created, Sentiment polarity, score and label for a response

Things not included

The "browser" fields... information on OS, web browser, resolution. We learned that Qualtrics changed the default of how this is added to the survey in the last few years, which had led to some inconsistency in how I was processing them.

This stuff can be programmed around. I'm doing this with the browser fields now, looking for questions that end in _browser. The same could be done to find *Topics and then re-assign those as response data.

It's a strange, arbitrary division that Qualtrics makes. NPS_GROUP is another calculated column, and that one comes through with question data. Same for all of the *DO* display order fields. Why should those and topics/sentiment/translations be treated differently?

chrisumphlett commented 2 years ago

Here's a 2.5m video that shows what I ultimately settled on with several hardcoded lists and the metadata() function.

juliasilge commented 2 years ago

Let's reopen this and still consider if we should add something like a dataframe of columns.

jmobrien commented 2 years ago

@chrisumphlett, I didn't notice the music in the last video until I realized I was grooving a bit by the end. Thanks for that.

You're describing things that closely overlap with what I dealt with in my own big project, so I can sympathize. The trouble with the browser meta items and the timing items, both, is that they are types of questions in the Qualtrics sense--an item that goes in a block on the main page (using the web interface). So I don't think we can do anything about that via tweaks to the API requests. Same for display ordering items, which despite being metadata are closely linked with their questions.

I actually worry about metadata() since it's an old function that looks at the V2 API, and it isn't always accurate. (IIRC one example is that if there's an embedded data field specified in a way that it never receives data, it will be listed in metadata() but won't be present in the response download, at least by default.) metadata() is likely to be deprecated eventually here, and Qualtrics will probably kill that endpoint eventually even if we don't. Unfortunately its replacement fetch_description() doesn't contain the same embedded data element.

We should think about this more, but like you I ended up setting up a programmatic solution. But I will say that was why I added extract_colmap(), which ended up being the "dataframe of columns" that gave me an approach that worked consistently across all columns

Below is one from my API testing survey. The key tool for me was the "ImportId" column, which is non-user-editable and thus far more standardized. I added in browser meta, timing, topics, quality checks & scoring to see:

All "questions" start with "QID".
Timing and browser meta again have standard suffixes, but with the "QID" prefix so it's unambiguous.
Question-level randomization gets "_DO" or "_ADO" appended (choice v answer randomization).
Flow- and block-level randomization is FL.*_DO & BL.*_DO, respectively.
metadata names have a unique set that are different than the qnames (which are the customizable names for q's/embeds)
scoring starts with SC_
Text topics start with the related item's QID, then extra stuff
everything else (embedded data, quality) just has the same ImportId as qname

Here's how they break out via API params:

Metadata (same as always, looks pretty fixed):

Questions (all QID, FL, BL, almost, see next):

Embedded (embeddeds, quality, scores, & topics. Topics breaks the pattern for QID's):

jmobrien commented 2 years ago

So, anyway, one option I'm thinking about now that I've written this is that we could apply this logic to the code that creates column mapping and make another column that labels variables by type based on "ImportId", which would make filtering like this a lot more straightforward. Wouldn't be too hard--basically something I did before for myself.

I could see edge cases we'd need to address (e.g., I've encountered embedded data fields labeled as "status" so you'd be facing duplicates). But I think it's something we could work around (we already know what metadata columns the user requested for any given download).

Still might be good if there's a way to ensure this stays up-to-date with whatever Qualtrics does for the API, but that seems like a separate question.

jmobrien commented 2 years ago

Caveat - we don't know what the user requested if they just call read_survey(). I'm unclear how much that actually happens, though, even though the read_survey() function is exposed.

chrisumphlett commented 2 years ago

re: the music. Using my company's video editor, Camtasia, you can add an "emphasize audio" effect to your screen recording and it automatically suppresses other audio to a good background level. Glad you enjoyed :)

I asked our customer researcher to not use status as an embedded data field after that caused a headache. It's another weird Qualtrics thing where the description ("Response Type") doesn't match the question name.

I agree w/you that Qualtrics treats the browser fields as questions because they are blocks in the survey flow. My philsophical objection I guess would be that it shouldn't be added to the survey flow in the first place. Perhaps there's some reason in their view that it needs to be done that way.

I don't think I have a good perspective on balancing usability for the broad user base of the package and these types of power user functionality. For me, the ability to get the embedded data dynamically has made a significant improvement in my process in combination with the static lists for respondent metadata, browser, and timing.

jmobrien commented 2 years ago

Thanks. Generally agree; separating out variable types was a really important part of my processes as well.

We might eventually want to think about something better for handling duplicated names, including things like status

ropensci / qualtRics

Add either a function or a data.frame in the package to get the list of respondent metadata columns #272