microbiomedata / sheets_and_friends

Enhance a LinkML model with imported and optionally modified slots
0 stars 0 forks source link

Swap first two columns in our DataHarmonizer interfaces #144

Closed turbomam closed 2 years ago

turbomam commented 2 years ago

As of 2022-07-29, the first two columns in our DataHarmonizer interfaces are:

  1. source_mat_id, which MIxS titles "source material identifiers" (but we have re-titled "globally unique ID" in sheets-for-nmdc-submission-schema)
  2. samp_name, which MIxS titles "sample name". That title isn't modified in sheets-for-nmdc-submission-schema, but some other attributes might have been.

I believe @mslarae13 would like to swap those order on all NMDC DataHarmonizer interfaces. In that case, samp_name is the column that will stay frozen when a user scrolls to the right. @mslarae13, did you have some additional expectation of how that would effect the way the DataHarmonizer saves the data?

I don't have any objection to this reordering.

mslarae13 commented 2 years ago

I don't. Fields all the same. Just want users to be able to see their human readable ID (samp_name) when they scroll right so they know what samples they're adding metadata for at each row.

turbomam commented 2 years ago

Thanks. I have to shut down for the day but will apply this change on Monday. It will be a matter of changing then rank values for source_mat_id and samp_name in sheets-for-nmdc-submission-schema. I just haven't remembered out which tab it goes in yet!

turbomam commented 2 years ago

aha, mixin_slots

we'll have to do it for each section that is modeled as a LinkML mixin.

turbomam commented 2 years ago

For all of our existing submission portal templates, I have swapped samp_name into the first column from the left and source_mat_id into the second column from the left.

I did that in the import_slots_regardless sheet. Unfortunately, I labelled the variable that determines column ordering rank in some sheets-for-nmdc-submission-schema sheets and column order in others. That should probably all be harmonized to rank, which is a real LinkML term.

You can see this at https://microbiomedata.github.io/sheets_and_friends/main.html?template=nmdc_submission_schema/soil_emsl_jgi_mg

You can still load https://github.com/microbiomedata/sheets_and_friends/blob/main/artifacts/for_data_harmonizer_template/exampleInput/soil_emsl_jgi_mg_example_data.tsv as example data, but DataHarmonizer will warn you that there is a mismatch between the expected columns according to the template and the provided columns in the example data

That's all of the thought I've given to the implications about changing column order. I think I'll still be able to retrieve the columns in the right order from the metadata_submission API, since the column names are being provided by the API response. But maybe there are some gotchas that I didn't anticipate.

This was all built in the main branch of sheets_and_friends, which is using DataHarmonizer commit 5c9182e as a git submodule.

@mcovalt and @pkalita-lbl : my understanding is that, if I pulled the latest DataHarmonizer commit into the submodule used by sheets_and_friends, my build and deploy workflows wouldn't work any more. @pkalita-lbl showed me how to display DH interfaces in the webpack/yarn dev server, but I haven't learned how to build a bundle that could be published into GH pages yet. I get the sense that even if I continue to publish to GH pages for showing my work to colleagues, that won't trigger an update to the dev or prod submission portals. I'd like to learn how to indicate that new templates are read for the portals.

pkalita-lbl commented 2 years ago

Yeah changes I made to DataHarmonizer broke the existing GitHub Pages deployment in sheets_and_friends. Since NMDC isn't consuming the GitHub Pages content via an iframe anymore I didn't think anyone was relying on it. sheets_and_friends can still have a GitHub Pages deployment, it just needs to be set up a little differently. Unfortunately I'm still waiting on Damion to get back to me about some decisions in DataHarmonizer, and I wouldn't bother making changes to the sheets_and_friends deployment until DataHarmonizer is settled -- otherwise it'll just have to change again.

@mcovalt please also chime in, but I would think the best way to indicate that new templates are ready is to open a PR in nmdc-server where you've updated this commit hash to the new version: https://github.com/microbiomedata/nmdc-server/blob/main/web/yarn.lock#L9244.

mcovalt commented 2 years ago

please also chime in, but I would think the best way to indicate that new templates are ready is to open a PR in nmdc-server where you've updated this commit hash to the new version: https://github.com/microbiomedata/nmdc-server/blob/main/web/yarn.lock#L9244

Yeah, that would be the way to go. There are likely ways this could be automated using GitHub tools, but that might be finicky to set up.

jeffbaumes commented 2 years ago

@mcovalt will attempt using this template locally with real data and ensure existing data loads properly with the column swap before deploying this change.

mcovalt commented 2 years ago

Testing this locally revealed the need to also swap existing columns in PostgreSQL. I'll follow up on Monday with the progress of that migration.

turbomam commented 2 years ago

Thanks @mcovalt !

I think I'm hearing "dependence on column ordering during data transport"

Will we be able to avoid that in the future?

mcovalt commented 2 years ago

The JSON we store represents the submission metadata as a list-of-lists. The second lists defines the mapping of subsequent lists. It's a bit like how a CSV header line defines the field name for subsequent lines.

In the frontend we must be loading the lists without regard to the mapping list. Now that I'm talking about this out-loud, a better solution wouldn't have involved a data migration, but instead would be to fix this frontend behavior.