pepkit / pephub

A web API and database for biological sample metadata
https://pephub.databio.org
BSD 2-Clause "Simplified" License
12 stars 2 forks source link

How do we convert the `handsontable` sample table representation to a PEP-compatible sample table representation? #376

Open nleroy917 opened 1 month ago

nleroy917 commented 1 month ago

Overview

Probably the most bug-prone step in the sample-table of the PEPhub UI is the conversion of the data-representation used by handsontable to the data-representation used in our database. Specifically, we need to convert an array-of-arrays, into an array-of-objects. You can view the current function deployed now.

Essentially the function must convert this:

[
  ['col1', 'col2', 'col3'],
  ['s1_col1', 's1_col2', 's1_col3'],
  ['s2_col1', 's2_col2', 's2_col3'],
]

Into this:

[
  { col1: 's1_col1', col2: 's1_col2', col3: 's1_col3' },
  { col1: 's2_col1', col2: 's2_col2', col3: 's2_col3' },
]

Things that make it tricky

There are problems with this conversion; moreover questions that need to be answered:

  1. What happens if a user has duplicate column names? This will lead to data-loss as attributes are overwritten
  2. What if a user has an empty column? This leads to objects with null as an attribute (which feels wrong)
  3. What if the user skips a row? Should it be blank or smart enough to know that they don't want that as a sample?

The conversion seems to be lossy by nature. In the interest of trying to balance out not doing magic behind the scenes but also promptly warning the user about potential errors, it becomes quite difficult to write the function, and I am looking for assistance.

nleroy917 commented 1 month ago

This issue hopefully addresses others raised:

nsheff commented 1 month ago

Here are some thoughts:

  1. do as much validation as possible on the server, not on the client. This way, anything posting bad data, wither from this particular client or other clients, will get a nice response. If we make the validators here, then other attempts to update things will run into issues.
  2. consider converting from array of arrays to array of objects inside python.

I think to solve your concern, I would write something that seems simpler than the function you linked. I would just do this:

  1. Consider first the header row.
  2. Check for duplicates. If any duplicates, return "Duplicate column header error" and fail.
  3. Check for nulls. If nulls are at the end of the array, do nothing (I guess just discard them). If nulls are in the middle of the array (there are values after nulls), return "Missing column header" error.
  4. Now look at the row data. Check for any rows with everything null. If there are rows with everything null, just remove them.
  5. Otherwise, now create your objects. and try to insert into the table.
  6. return any error from the database, if it fails.
sanghoonio commented 1 month ago

@khoroshevskyi @nleroy917 how much processing should we do serverside?