[fix]: Update datasets syncing script

vpchung commented 8 months ago

Changelog

update default value for manifest_id → syn53478774
add two new args (output_csv, verbose)
update colnames to match new manifest colnames
remove dataset folder creation
output final clean table to CSV
"sync" rows by truncating the existing rows and syncing over all rows from the latest manifest
create a new table snapshot

Preview

--dryrun enabled
--dryrun and --verbose enabled
Syncing (to a dummy table)

TODO

Update the portal table's schema to match this table's - otherwise, there will be uploading conflicts with the existing schema

To be discussed

@Bankso @aditigopalan

This syncing script no longer needs to create the dataset folders, correct? I noticed that the manifest has two columns with synIDs, so I assumed these are the new Synapse folders created for the datasets.
How should the script handle DatasetView_id when there are multiple synIDs listed? (The portal table's datasetId can only have one synID.) Currently, the script is exploding the column - would that be the appropriate approach?
I noticed in #40 that checking for new publications to add has been removed - instead, we'll be re-adding all publications to the portal. This was an oversight on my part and I forgot to clarify in the PR, but in case this was an intentional move, should we update all the other syncing scripts to do the same?

Bankso commented 8 months ago

This syncing script no longer needs to create the dataset folders, correct? I noticed that the manifest has two columns with synIDs, so I assumed these are the new Synapse folders created for the datasets.

I implemented a script that "mints" Synapse Ids for all entries in a manifest (it creates a folder, returns the syn id, deletes the folder, and adds the syn id to the <component>_id column). As long as the folders aren't required for the portal, then we don't need to create them.

How should the script handle DatasetView_id when there are multiple synIDs listed? (The portal table's datasetId can only have one synID.) Currently, the script is exploding the column - would that be the appropriate approach?

If I understand correctly, datasetId in the Portal - Datasets Merged table is used as an identifier to build the URL for the dataset info page. To keep in line with that framework, we could take the first entry in DatasetView_id for each row and use that as the datasetId for the portal. I would prefer to have a single entry for each dataset, instead of redundant entries for each DatasetView_id, but I would be interested in thoughts on this from @aditigopalan and @aclayton555, too.

I wanted to clarify the reason for the multiple DatasetView_id entries in some rows: since some datasets (and pubs and tools) are related to more than one grant, we split up metadata by grant number before uploading to each grant-based Synapse project. When all of that metadata gets pulled together into the union table, there are multiple rows, each with unique DatasetView_id entries, associated with a single Dataset Alias. Part of the union table processing is to merge rows with matching Dataset Alias values, resulting in some datasets having multiple entries in the DatasetView_id column. Prior to the database "Merged" table, DatasetView_id values are only being used as primary keys and by schematic to determine which rows to update via upsert.

I noticed in https://github.com/mc2-center/mc2-center-dcc/pull/40 that checking for new publications to add has been removed - instead, we'll be re-adding all publications to the portal. This was an oversight on my part and I forgot to clarify in the PR, but in case this was an intentional move, should we update all the other syncing scripts to do the same?

That was intentional on my part and I also forgot to clarify with you on the optimal approach here. Since we plan to allow contributors to update their metadata at anytime, I think it makes sense to ensure all metadata in the database is up-to-date with the processed union tables at each release. My non-elegant approach here was to upload the entirety of the processed union table CSV at each release. I have two ideas for alternative approaches:

the script checks for differences between the incoming CSV and the current table info and updates rows/adds new entries accordingly, OR
we add a column to the metadata that notes the date of the most recent modification to the info in the row and use that to determine if the information should be synced to the portal. This would allow us to sync both new entries and updated entries, without running a bunch of comparisons.

Sorry this reply is so long, but please let me know if I can clarify anything.

vpchung commented 8 months ago

As long as the folders aren't required for the portal, then we don't need to create them.

OK great! I will go ahead and remove the commented out lines for folder creation then 👍🏼

To keep in line with that framework, we could take the first entry in DatasetView_id for each row and use that as the datasetId for the portal.

Ohh, I see now. I thought DatasetView_id contained synIDs that all mapped to different entities - however, if they all map to the same entity, then yes, taking just the first value makes a lot of sense. I will go ahead and do that instead.

Since we plan to allow contributors to update their metadata at anytime, I think it makes sense to ensure all metadata in the database is up-to-date with the processed union tables at each release.

Yes, that makes a lot of sense! Another option we could do is to approach it like how we do with the People portal table, in that with each sync, we remove all existing rows and re-add the latest manifest. Currently, the pubs syncing script is appending to the portal table, not replacing or updating. That is, with each sync, it will add n number of manifest rows to the existing portal rows, which is probably what we don't want 😆

vpchung commented 8 months ago

^^ For visuals, here's a dummy table I created where I ran the pubs script twice, where it now contains 6922 rows.

Bankso commented 8 months ago

^^ For visuals, here's a dummy table I created where I ran the pubs script twice, where it now contains 6922 rows.

I see, thank you for helping me understand this! For some reason, I thought this call to store:

syn.store(Table(schema, table_rows))

would replace the table. That was my mistake!

Bankso commented 8 months ago

Another option we could do is to approach it like how we do with the People portal table, in that with each sync, we remove all existing rows and re-add the latest manifest.

I like this approach! I get it now - from the sync_people script, it is quite a bit more involved to remove/update rows in a Synapse table.

vpchung commented 8 months ago

No worries! The original Python docs weren't very clear on how to work with tables, but I believe Bryan is working on improving the tutorials. But generally, if you want to replace the table, you would have to remove all the rows first, e.g.

# Get current rows from the table, then delete them.
table_id = "syn123"
current_rows = syn.tableQuery(f"SELECT * FROM {table_id}")
syn.delete(current_rows)

# Do something with the new table.
...

# Convert table to a list of lists.
new_rows = some_table.values.tolist()
syn.store(Table(table_id, new_rows))

^^ In case you need to do it in another script! Otherwise, syn.store will append to the table. If you were to include the etag in syn.store(), then that will update the table.

vpchung commented 8 months ago

I like this approach!

Great! I will track this effort in #44 and will implement the same for our other syncing scripts as well. Thanks, @Bankso !

vpchung commented 8 months ago

@Bankso @aditigopalan syncing script has been updated to address the comments made earlier. Screenshots are available here.

Let me know if there's anything you'd like to change!

mc2-center / mc2-center-dcc