spotfiresoftware / spotfire-mods

Spotfire® Mods
https://spotfiresoftware.github.io/spotfire-mods/
Other
56 stars 41 forks source link

Critical issue: does Mods automatically remove duplicated data rows? #78

Closed mwdpb closed 2 years ago

mwdpb commented 2 years ago

It appears that mods automatically remove duplicated data rows. This is fatal when analyzing statistical distributions. Could you please confirm one way or another? If yes, how to fix it? This is a critical issue and please help take a look asap, thank you!

Screen capture below showed that one data point was missing as its "Measurement" value (1939.20) is the same as another row even though the "C.Row" is different. It is normally not a problem as we are likely want to differentiate it by color by "C.Row". But when doing statistical analysis, we put everything into a pool and therefore don't do color by "C.Row" anymore. Notice how the summary table below histogram plot shows the sample count is "4".

image

To further confirm, I checked the dataview row count and it is 4 and the data readout has one 1939.20 point missed. Below is the code snippet and console output.

image

export async function prepareDataObject(dataView, yExpression, colorExpression) {
  const data = {};
  const rows = await dataView.allRows();
  console.log('dataView row count is: ', rows.length);

  rows.forEach(row => {
    const cname =
      colorExpression === '<>' ? yExpression : row.categorical('Color').formattedValue();
    const cval = row.color().hexCode;
    const key = makeKey(cname, cval);
    const yval = row.continuous('Y').value();
    if (parseFloat(yval)) {
      if (key in data) {
        data[key].push(row.continuous('Y').value());
      } else {
        data[key] = [row.continuous('Y').value()];
      }
    }
  });

  console.log(data[Object.keys(data)[0]]);
  return data;
}

export const sep = '-$-';
const makeKey = (cname, cval) => (cname + sep + cval).replace(/\s/g, '');
objerke commented 2 years ago

Hi @mwdpb. You are correct that duplicate rows are combined. This is because mods data views are always aggregated.

The solution is the same as in this related question; the number of collapsed duplicate rows when allowNonAggregatingMeasures is used can be found by adding an extra axis with "count()" as its expression.

hski-github commented 2 years ago

Are you working on an alternative version of Box Plot? You are doing the statistical calculation then in the mod with JavaScript?

If you add the primary key of the measurement then you get formally still an aggregated data view, but on the detail level of an individual measurement (group by measurementID is ID is unique you get same as data table). You are currently grouping by measurement value and the you get one row per measurement value.

Btw this will not scale for huge amount of measurement values, because all the data will be transferred to the mod to the client web browser.

mwdpb commented 2 years ago

Hi @mwdpb. You are correct that duplicate rows are combined. This is because mods data views are always aggregated.

The solution is the same as in this related question; the number of collapsed duplicate rows when allowNonAggregatingMeasures is used can be found by adding an extra axis with "count()" as its expression.

Hi @objerke, thanks for the quick response. Could you elaborate how to add "count()" function? Here is what I got when I add it as a second column.

image

mwdpb commented 2 years ago

Are you working on an alternative version of Box Plot? You are doing the statistical calculation then in the mod with JavaScript?

If you add the primary key of the measurement then you get formally still an aggregated data view, but on the detail level of an individual measurement (group by measurementID is ID is unique you get same as data table). You are currently grouping by measurement value and the you get one row per measurement value.

Btw this will not scale for huge amount of measurement values, because all the data will be transferred to the mod to the client web browser.

Yes, I'm working on a custom Box Plot with Mod. Everything works smoothly until I ran into this aggregation issue. Could you explain how to do group by meansurementID? I tried using "OVER" function but couldn't get it working.

BTW, what is the reason to collapse on duplicated rows in the first place? I think people will be caught in surprise when trying to use Mod to display a data table with identical rows.

hski-github commented 2 years ago

Assuming data like this

Board | Measurement ID | Measurement Value A | 1 | 1900,58 A | 2 | 1924,35 A | 3 | 1939,2 A | 4 | 1939,2 A | 5 | 1918,95

Mods is an aggregated view on the data. You don't get access to the underlying data table. You can think about the mod data view as a kind of pivot table or cross table. It needs a categorial axis and a continuous axis with an aggregation. Try it with a normal pivot table / cross table in Spotfire.

You have configured your mod like this cross table

grafik

If you define your mod axis like this, then you still have an aggregated view (average value of Measurement Value), but because of group by unique key, it is the average of one value.

grafik

What @objerke is proposing is continue with your approach, but add rowCount() as aggregation and then you know, that a certain value was multiple times in the original data

grafik
objerke commented 2 years ago

Hi @mwdpb. You are correct that duplicate rows are combined. This is because mods data views are always aggregated. The solution is the same as in this related question; the number of collapsed duplicate rows when allowNonAggregatingMeasures is used can be found by adding an extra axis with "count()" as its expression.

Hi @objerke, thanks for the quick response. Could you elaborate how to add "count()" function? Here is what I got when I add it as a second column.

@mwdpb I meant that you can add an additional (potentially hidden) continuous axis in your mod manifest. That axis can be programmatically set to always be Count().

Something like this can be added to the axes in your mod-manifest.json

{
    "name": "Fixed count axis",
    "mode": "continuous",
    "legendItem": {"defaultVisibility": "hidden"},
    "propertyControl": {"visibility": "hidden"}
}

And then you can set this to Count from your initialization code in the mod:

    mod.visualization.axis("Fixed count axis").setExpression("Count()");

This should preferably be done via a button since it is a modification and will add an extra undo-step.

@hski-github Thank you for your great example!

ayh20 commented 2 years ago

Please vote on the Ideas portal if you want mods to be able to get unaggregated data. https://ideas.tibco.com/ideas/TS-I-8069 It's perfectly reasonable to get Spotfire to get the data into the best format before passing to a Mod, however there are use cases for mods where unaggregated data is required.

mwdpb commented 2 years ago

Hi @objerke @hski-github @ayh20, thank you all! I'm clear now on how Mod data works under the hood and I agree that this pre-processing is useful for most of user cases. I did add 5 votes on the ideals portal as sometimes passing raw data to end users is also warrantied.

Thanks again for the help!