pulibrary / pdc_discovery

Princeton Data Commons discovery portal for Research Data
10 stars 0 forks source link

Bug: Recent dataset list on the main page is not reflective of new dataset #542

Closed astrochun closed 7 months ago

astrochun commented 9 months ago

UPDATE: Sounds like we don't yet have quite the right definition of what should be on the Recently Published feed. Working with @astrochun and @matthewjchandler to figure out what that should be.

On Friday, December 8, we published a new dataset in PDC:

However, on the main Discovery page this does not show up under "Recently published".

@hectorcorrea points out that the sorting is done by perhaps year, so it's not capturing the proper order between each dataset.

Screen Shot 2023-12-11 at 11 36 06 AM

Acceptance criteria

hectorcorrea commented 9 months ago

For reference: there were 17 works published this year (2023) and we only show 5 in the recent works page. The sort is done by year so there is no easy way to determine which one is more recent of those 17 works.

astrochun commented 9 months ago

For reference: there were 17 works published this year (2023) and we only show 5 in the recent works page. The sort is done by year so there is no easy way to determine which one is more recent of those 17 works.

Is there a published metadata date that we can use?

bess commented 8 months ago

We can use either created_at or updated_at, which exist on the database record but not in the Datacite record. For the referenced data set, these are:

 created_at: Fri, 17 Nov 2023 14:10:06.509405000 EST -05:00,
 updated_at: Fri, 08 Dec 2023 11:41:34.649789000 EST -05:00,

I'm going to index them both so we can try out which of these works better in the UI.

astrochun commented 8 months ago

Great. I suspect created date makes the most sense

leefaisonr commented 7 months ago
Screenshot 2024-01-17 at 4 11 07 PM
leefaisonr commented 7 months ago

Done

bess commented 7 months ago

@astrochun Asked us to look into why the dataset he mentioned isn't on the front page. The dataset in question is this one: https://datacommons.princeton.edu/describe/works/201 Its created_at date is 17 Nov 2023. On 18 Jan 2024, the list of Recently Added datasets on the front page of PDC Discovery production have these dates for when they were added to PDC Describe:

["2024-01-08T11:51:38Z",
 "2023-12-22T14:00:38Z",
 "2023-12-22T13:47:37Z",
 "2023-12-22T13:43:37Z",
 "2023-12-22T13:35:55Z",
 "2023-12-22T13:26:33Z",
 "2023-12-22T13:20:44Z",
 "2023-12-22T12:37:29Z",
 "2023-12-22T12:24:52Z",
 "2023-12-22T12:13:31Z"]

All of which are more recent than 17 Nov 2023. Do we instead need some combination of "publication date" and "created_at"?

astrochun commented 7 months ago

All of which are more recent than 17 Nov 2023. Do we instead need some combination of "publication date" and "date added"?

@bess perhaps. The issue is that we did not migrate the data in chronological order and had to publish new datasets. A lot of those aren't recent datasets but from a few years back. I know this is a challenging one to fix since the metadata is a bit limited. If we can filter out those that have a publication date on or before 2022, that should capture more of the recent datasets. I think once we have more datasets, this will resolve itself.

bess commented 7 months ago

Another idea: Maybe we exclude anything that was migrated?

matthewjchandler commented 7 months ago

Here's my two cents: "recently published" should sort in reverse-chronological order by the date of first issue (not update/edit, and not migration); and once we get past the migration phase, I don't expect much confusion about what recently went into PDC vs. what was recently published for the first time.

bess commented 7 months ago

I talked to @matthewjchandler on slack and after discussion he now agrees we should sort by the pdc created_at timestamp (since date of issue is not granular enough to do meaningful sorting) but exclude migrated works.

leefaisonr commented 7 months ago
Screenshot 2024-01-19 at 1 39 00 PM