prio-data / viewser

A CLI for interacting with the ViEWS 3 backend
Other
4 stars 0 forks source link

Deserialization error for some querysets #34

Open Peder2911 opened 2 years ago

Peder2911 commented 2 years ago

Me and @jimdale found an issue where viewser would raise a deserialization error, while there was obviously at least partial Parquet bytes data in the response:

DeserializationError: DeserializationError:

  Description:
                Could not deserialize as parquet:             "b'PAR1\x15\x04\
                x15\xe0D\x15\xf8?L\x15\xcc\x08\x15\x04\x12\x00\x00\x1f\x8b\x08
                \x00\x00\x00\x00\x00\x00\x03-Wi8Vk\x1b5e\x1e\xdei\x8f\xafY*\x9
                1\xc21'..."

This only seems to happen with certain querysets. The queryset that lead to this error was:

queryset = (Queryset("jim_fatalities_conflict_history_lag_tdecay", "priogrid_month")

            # target variable
            .with_column(Column("ln_ged_sb", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                        )

            # spatial-tree-lagged d^-2 target variable
             .with_column(Column("ln_ged_sb_treelag_2_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,2)
                        )

            # 1 tlagged spatial-tree-lagged d^-2 target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_2_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,2)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )

            # spatial-tree-lagged d^-1 target variable
             .with_column(Column("ln_ged_sb_treelag_1_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,1)
                        )

            # 1 tlagged spatial-tree-lagged d^-1 target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_1_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,1)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )

            # spatial-tree-lagged ln(1+d) target variable
             .with_column(Column("ln_ged_sb_treelag_0_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,0)
                        )

            # 1 tlagged spatial-tree-lagged ln(1+d) target variable
             .with_column(Column("ln_ged_tlag_1_sb_treelag_0_th1_0", from_table = "ged2_pgm", from_column = "ged_sb_best_sum_nokgi")
                         .transform.missing.fill()
                         .transform.ops.ln()
                         .transform.spatial.treelag(thetacrit_tree,0)
                         .transform.temporal.tlag(1)
                         .transform.missing.fill()
                        )
             )

To begin diagnosing this, we need to write some tooling for dumping the erroneous response data to figure out what is being returned that is not deserializable. This will give us a clue about whether or not the issue is being caused by something upstream, or is caused by some issue with deserialization.

A clue is that there is no exception happening upstream, which means that the data is written to parquet and sent away just fine. This hints towards there being something wrong with viewser.

Peder2911 commented 2 years ago

This seems to have something to do with our current network topology, as no issue has been found with viewser. We have inspected dumped data from when this issue occurs, and there is no exception, nor any deserialization malfunctions in viewser. The data is simply incomplete out of transfer, which again seems to suggest that something interrupts the connection mid-transfer.

It will be interesting to see if this issue persists with our new servers.