analyze long format data

alexholcombe commented 8 years ago

Support data in long format, not just wide format (if I'm correct that long format is not supported). Would allow for easy use of R with same dataset. (e.g., my PhD student put data in wide for JASP, then we have to first melt into long to do R analyses)

FransMeerhoff commented 7 years ago

Hello @alexholcombe

Thank you for your suggestion. We indeed are considering this, but we have not yet planned this issue for the near future. For Bayesian analyses this support for long format can lead to bad performance.

How important is it for you, since a conversion before loading a file into JASP is possible?

Kind regards

alexholcombe commented 7 years ago

Not important for me personally, thanks.

EJWagenmakers commented 7 years ago

Can we increase the priority level for this request? (I suggest "high") Many people have data in long format and having to recode the formats manually is a drag. E.J.

AlexanderLyNL commented 7 years ago

Sure, it would be good if we have some use cases that we can work with. I know who I can allocate this to.

lindeloev commented 6 years ago

Here's a use case that I often encounter with my students who do intervention research: RM-ANOVA on RCT-like studies. For this setup, there is a within-subjects factor called session (pretest vs. posttest) and a between-groups factor called group (treatment vs. control) for a particular outcome. The session x group interaction effect is of primary interest, i.e. whether the groups developed differently over time. Because the sessions are two different... sessions... they are entered as separate rows in a long format.

It would be very convenient to just enter

Dependent = RT
Within-subjects = session
Between-subjects = group

However, the students have to transform sessions into columns ("RT.pre," "RT.post") and then again typing in the levels ("pre" and "post"), even though these were already given in the "session" variable. This becomes extremely cumbersome when there are several within-subject variables, e.g. in a project where we had congruency as a within-subjects factor so we reshaped to the columns "rt.pre.incongruent", "RT.pre.congruent", "RT.post.incongruent", and "RT.post.congruent". It would be so much easier just to do

Dependent = RT
Within-subjects = session x congruency
Between-subjects = group

... and a huge win over SPSS! In addition, having just one dependent variable would be consistent with the layout of other analyses making it easy-peasy to understand what's going on. In our case, it was particularly bothersome because congruency was only important for some analyses, so we had to reshape the data depending on the analysis we wanted to do.

Implementation-wise, maybe there could be a radio button above the selected variables where you could select "wide format" or "long format" and the form would adjust appropriately. The wide format is in place. The long format would be a one-variable field with "dependent variable" and a list-field with "Within Subject Factors"

juliusverrel commented 6 years ago

This would be a really helpful feature. Wide format tables, while preferred by, e.g., SPSS, are really awkward to work with if you have multiple dependent variables in combination with repeated measures. For a repeated measures design with two within-subject factors A (levels 1,2) and B (levels 1,2,3), and two DVs, say RT and PC (percent correct), in wide format, you easily end up with 12 columns, RT_A1_B1, RT_A1_B2, RT_A1_B3, ..., PC_A2_B3. Any additional DV or factor multiplies the number of columns.

Long format is also preferred for "tidy data".

Thus, as pointed out by others above, being able to analyze data in long format (rather than converting them to wide format, prior to import or within JASP) would be of great benefit.

JorisGoosen commented 6 years ago

My proposal for converting long to wide:

My first hunch for this was to have the loading process check if the data might be in long-format and give the user a prompt with the question if they want to convert it. It seems to be the case however that there isn't a straightforward heuristic to determine if the data is in long- or wide-format securely and I dontt want to bother people with prompts unnecessarily. So Id suggest to add a button to the "Common" ribbon with the option to convert long -> wide and which opens up a conversion-gui.

The conversion-gui itself would guide a user through the conversion process as graphically as I can think of. It would show an overview of the available columns in the dataset, there the user can choose certain columns to be used for the conversion process.

Now, I am new to the whole wide vs long tabular formats, my understanding is that in the long format there are:

certain columns that define the unit that is being observed and those are the ones that uniquely define each row in the wide format. They are the columns by which one collapses so to speak. If this in one column than this means that for each unique value in the long format a row is added to the wide format. If it is more that one column this would be each unique combination of values in these columns (by cartesian product) leading to a single row in wide format.
certain columns denote the observations that were made and could also be called results. These would be measurements or something akin. These will be multiplied by the condition variables/columns and the resulting cartesian product of the two will result in new columns in the wide format.
the rest of the columns are either:
- collapse-variables if they do not ever change for in conjunction with unique chosen collapse variables. (An example would be sex in combination with the chosen subject id, which would in normal circumstances remain stable during an experiment)
- conditional variables if they do vary in combination with a unique collapse variable.

To make this a bit clearer an example:

subNr cond reacTime gender 1 control 0.954449669415963 M 1 blue 0.0107843800066946 M 1 red 0.00625554891266237 M 1 green 0.390050820540637 M 2 control 0.681751611176878 M 2 blue 0.508678092155606 M 2 red 1.38064901805939 M 2 green 0.358425999991596 M 3 control 1.8436117748821 F 3 blue 0.844668731091743 F 3 red 4.30149267961211 F 3 green 0.150224776007235 F 4 control 0.502509086858481 F 4 blue 0.214017302504597 F 4 red 4.76388974835105 F 4 green 1.26682419605078 F

To change this to wide I would suggest the user to select a collapse variable, which would be "subNr" (and "gender" could be selected as well, but suppose the user didn't select it). Then the user selects a result-column, "reacTime" here.

The heuristic of the gui would then see that for each unique instance of the collapse-variable-subNr "gender" is always the same. This would then be treated as a collapse variable as well, which is to say it will not be multiplied by the conditions. The only conditonal column here is "cond" so each unique value will be taken and coupled with the resultcolumn "reacTime".

The resulting columns in wide format would be: subNr, gender, reacTime_control, reacTime_red, reacTime_green, reacTime_blue These columns will be shown to the user and the names be editable and if the user accepts the conversion will be made. The user could also decide to drop certain columns and all this will be done through the gui.

In case there would be two conditional columns, let say a column "direction" with data: "up" & "down", the resulting columns would be: subNr, gender, reacTime_control_up, reacTime_red_up, reacTime_green_up, reacTime_blue_up, reacTime_control_down, reacTime_red_down, reacTime_green_down, reacTime_blue_down

Supposing otherwise and the same column "direction" is actually a result than the resulting columns would be: subNr, gender, reacTime_control, reacTime_red, reacTime_green, reacTime_blue, direction_control, direction_red, direction_green, direction_blue

Does this sound good?

juliusverrel commented 6 years ago

@JorisGoosen: Thank you for making this more specific. If we stick to converting from long to wide, your proposal sounds very good. Be prepared that incomplete designs (missing combinations of factor levels) are less obvious in long format, so these might only become apparent (and produce errors) during this conversion process.

However, I'd like to point out that converting long data to wide format is probably not the optimal procedure. For a data set with many within-subject variables the current procedure (e.g., for ANOVA) of selecting a data column for each combination of factor levels is quite awkward and error-prone. With, say, 3 within-factors with 2/3/4 levels, this means manually selecting 24 columns. If the data were left in long format (and JASP "understood" long format, of course), one could simply select those three columns which define the factors. This would be A LOT simpler for users and likely would produce fewer errors.

So, I'd like to suggest that it would be better to make JASP flexible re long vs. wide format - not by converting everything to wide format, but by allowing analyses in either of the two.

JorisGoosen commented 6 years ago

@juliusverrel Thank you for looking at my proposal, that is always helpful. However I think you might have misunderstood the way I see the user-interaction, in your example:

there would be 3 columns in the long-format right?
where the columns have 2, 3 or 4 levels respectively?

The way the gui will work is that once you select the three columns for use as conditions the code will create the new 24 columns automatically. This is easy because JASP will simply scan the contents of each 'long'-column and can then generate the new 'wide'-columns based on that. The only thing a user might want to do is rename each of the 24 columns (but they would be given sensible names in any case, they might just be a bit long).

The other suggestion: making JASP play nicely with data in long-format directly would require a big rewrite of the central code of JASP. The way the data is represented in the table and would also require a rewrite of every single analysis available.. This is of course quite a lot more work than adding an extra conversion step. It doesn't seem worthwhile to go to all that effort for something that will in the end not make a difference for the results. (because the data will ultimately be the same)

juliusverrel commented 6 years ago

@JorisGoosen sorry for not replying earlier. your description of long-to-wide conversion sounds good. this would be "a" solution, but I'd still like to emphasize that the option to analyze long-format data without prior conversion, would be MUCH better, for a lot of reasons:

Wide format is bad style according to "tidy data" standards.
Long format is preferred by serious statistics software (e.g., most R packages).
Most importantly: Selecting appropriate columns for a repeated measures (RM) analysis in the current implementation is painful and error-prone as soon as you have more than one within-subject factor. In my example (three factors with 234 levels) you have to pick 24 columns called, say, reacTime_green_short_easy, reacTime_blue_long_dificult, reacTime_red_medium_easy, ... any tiny mistake in assigning these 24 names to 234 factor combinations will mess up your analysis. and this is something you have to repeat for every analysis and every dependent variable. Please believe me, this is really, really problematic.

Actually, if you prefer to stick to either wide or long format, i would strongly advise go for long (and add wide-to-long conversion for people used to wide).

Or, maybe preferably, implement both options. You wouldn't need to rewrite every single analysis: this would only affect RM analyses (including paired t-tests) - as far as i can see, there are only 4 of these.

Many thanks for considering this. I wouldn't insist if I didn't think this would be a great improvement!

juliusverrel commented 6 years ago

Any news on this? Sorry to insist, but this would very much simplify our everyday lab life :-)

JorisGoosen commented 6 years ago

Hello @juliusverrel,

No news yet, i am kind of busy on other projects. But we we're just talking about this actually and it has come to my attention that the repeated measure ANOVA in JASP uses a form of conversion to long format internally. It might be useful to abstract this out and make it a generally available thing for JASP-analyses.

@JohnnyDoorn what is your thought on that? Would that be feasible?

Then we would still need the long-to-wide format conversion to operate well within the framework of JASP itself though.

JorisGoosen commented 5 years ago

@EJWagenmakers came with the following link and Im adding it here for future reference: https://twitter.com/GuyProchilo/status/1180928746041294849?s=19

TheDom42 commented 4 years ago

I'm also very interested in using long format data in JASP. As far as I can tell from the implementation of the LMMs and the accompanying GIF in JASP 0.13, long data seems to be fully implemented now, is that correct? Or is this only the case for LMMs?

JorisGoosen commented 4 years ago

Im going to guess it was a special conversion specifically for that analysis, like with ANOVA RM, right @FBartos ?

FBartos commented 4 years ago

Long data are the only possible input for LMMs and as far as I know, it's (unfortunately) the only analysis that supports them. (there is co conversion happening, the long data are loaded and analyzed directly)

gfaity commented 4 years ago

I'm also interesting in the possibility of using long format in t-test by exemple. I actually work with R (long format) but with new results I prefer to do a counter analysis with Jasp (to avoid human error). And when I give my student some data to analyze, it would gain time instead of changing the format of my data before giving them.

JorisGoosen commented 4 years ago

Well it certainly is on my to-do list, I'm hoping to have the time to work on it somewhere in the coming months.

DrLarsson commented 2 years ago

I came in here to suggest a wide to long conversion but it seemed like it was going on. Is there any update on such a feature?

JorisGoosen commented 2 years ago

Well, the last time I claimed I was hoping to get to this in a couple of months I was apparently off by a couple of years...

But while we are currently working hard on releasing 0.16.2 and 0.16.3 the version after that will contain a complete overhaul of the internal storage of data within JASP as it runs.

This is the first step to actually implementing a more flexible way of handling data, in first instance to support proper data-editing in general. But that could be followed shortly by supporting long-to-wide-and-back conversions as necessary. With a bit of luck this will be something for the end of the year.

DrLarsson commented 2 years ago

I would be forever grateful!

JorisGoosen commented 2 years ago

Seeing by the amount of request we get for this over time I am sure you are not the only one!

We definitely haven't forgotten about it, it just isn't exactly trivial to do as the whole application has been setup from the start to handle only wide data up until now. But we'll get there eventually ^^

tomtomme commented 10 months ago

besides

tidyr::pivot_wider()

there is also the melt() function to convert wide to long:

#load reshape2 package to use melt() function
library(reshape2)

#melt data into long format
melt_data <- melt(data)

which seems even easier/shorter to code

GregorDall commented 9 months ago

Hi,

to add to this discussion: I am usually working in R and trying to set up JASP as a way for empowering students to do their statsitical anaylsis. It looks like JAPS is a great tool for that, in particular because it supports mixed model analysis. I see however two crucial features missing: Filtering and pivoting (long to wide, wide to long) data. These do not necessarily need to change how JASP is working, but coudl be seperate features in the Edit data window. As @tomtomme mentioned, these are easily implemented on the R side of things using dplyr or tidyr, I do not know about the GUI side of things though. This comment is meant as an ecouraging post for the developers, underpinning the necessity for such features.

Cheers Gregor

tomtomme commented 9 months ago

@GregorDall Nice, that you like jasp. Filtering however is available. In data view click the black filter button on the left for metric var filtering. For categorials double click on var name to open the filter view.

jasp-stats / jasp-issues

analyze long format data #33