Closed vpchung closed 8 months ago
This syncing script no longer needs to create the dataset folders, correct? I noticed that the manifest has two columns with synIDs, so I assumed these are the new Synapse folders created for the datasets.
I implemented a script that "mints" Synapse Ids for all entries in a manifest (it creates a folder, returns the syn id, deletes the folder, and adds the syn id to the <component>_id
column). As long as the folders aren't required for the portal, then we don't need to create them.
How should the script handle
DatasetView_id
when there are multiple synIDs listed? (The portal table'sdatasetId
can only have one synID.) Currently, the script is exploding the column - would that be the appropriate approach?
If I understand correctly, datasetId
in the Portal - Datasets Merged table is used as an identifier to build the URL for the dataset info page. To keep in line with that framework, we could take the first entry in DatasetView_id
for each row and use that as the datasetId
for the portal. I would prefer to have a single entry for each dataset, instead of redundant entries for each DatasetView_id
, but I would be interested in thoughts on this from @aditigopalan and @aclayton555, too.
I wanted to clarify the reason for the multiple DatasetView_id
entries in some rows: since some datasets (and pubs and tools) are related to more than one grant, we split up metadata by grant number before uploading to each grant-based Synapse project. When all of that metadata gets pulled together into the union table, there are multiple rows, each with unique DatasetView_id
entries, associated with a single Dataset Alias
. Part of the union table processing is to merge rows with matching Dataset Alias
values, resulting in some datasets having multiple entries in the DatasetView_id
column. Prior to the database "Merged" table, DatasetView_id
values are only being used as primary keys and by schematic to determine which rows to update via upsert
.
I noticed in https://github.com/mc2-center/mc2-center-dcc/pull/40 that checking for new publications to add has been removed - instead, we'll be re-adding all publications to the portal. This was an oversight on my part and I forgot to clarify in the PR, but in case this was an intentional move, should we update all the other syncing scripts to do the same?
That was intentional on my part and I also forgot to clarify with you on the optimal approach here. Since we plan to allow contributors to update their metadata at anytime, I think it makes sense to ensure all metadata in the database is up-to-date with the processed union tables at each release. My non-elegant approach here was to upload the entirety of the processed union table CSV at each release. I have two ideas for alternative approaches:
Sorry this reply is so long, but please let me know if I can clarify anything.
As long as the folders aren't required for the portal, then we don't need to create them.
OK great! I will go ahead and remove the commented out lines for folder creation then 👍🏼
To keep in line with that framework, we could take the first entry in DatasetView_id for each row and use that as the datasetId for the portal.
Ohh, I see now. I thought DatasetView_id
contained synIDs that all mapped to different entities - however, if they all map to the same entity, then yes, taking just the first value makes a lot of sense. I will go ahead and do that instead.
Since we plan to allow contributors to update their metadata at anytime, I think it makes sense to ensure all metadata in the database is up-to-date with the processed union tables at each release.
Yes, that makes a lot of sense! Another option we could do is to approach it like how we do with the People portal table, in that with each sync, we remove all existing rows and re-add the latest manifest. Currently, the pubs syncing script is appending to the portal table, not replacing or updating. That is, with each sync, it will add n number of manifest rows to the existing portal rows, which is probably what we don't want 😆
^^ For visuals, here's a dummy table I created where I ran the pubs script twice, where it now contains 6922 rows.
^^ For visuals, here's a dummy table I created where I ran the pubs script twice, where it now contains 6922 rows.
I see, thank you for helping me understand this! For some reason, I thought this call to store
:
syn.store(Table(schema, table_rows))
would replace the table. That was my mistake!
Another option we could do is to approach it like how we do with the People portal table, in that with each sync, we remove all existing rows and re-add the latest manifest.
I like this approach! I get it now - from the sync_people script, it is quite a bit more involved to remove/update rows in a Synapse table.
No worries! The original Python docs weren't very clear on how to work with tables, but I believe Bryan is working on improving the tutorials. But generally, if you want to replace the table, you would have to remove all the rows first, e.g.
# Get current rows from the table, then delete them.
table_id = "syn123"
current_rows = syn.tableQuery(f"SELECT * FROM {table_id}")
syn.delete(current_rows)
# Do something with the new table.
...
# Convert table to a list of lists.
new_rows = some_table.values.tolist()
syn.store(Table(table_id, new_rows))
^^ In case you need to do it in another script! Otherwise, syn.store
will append to the table. If you were to include the etag
in syn.store()
, then that will update the table.
I like this approach!
Great! I will track this effort in #44 and will implement the same for our other syncing scripts as well. Thanks, @Bankso !
@Bankso @aditigopalan syncing script has been updated to address the comments made earlier. Screenshots are available here.
Let me know if there's anything you'd like to change!
Changelog
manifest_id
→ syn53478774output_csv
,verbose
)Preview
--dryrun
enabled--dryrun
and--verbose
enabledSyncing (to a dummy table)
TODO
To be discussed
@Bankso @aditigopalan
DatasetView_id
when there are multiple synIDs listed? (The portal table'sdatasetId
can only have one synID.) Currently, the script is exploding the column - would that be the appropriate approach?