greenelab / lab-website-template

An easy-to-use, flexible website template for labs.
https://greenelab.github.io/lab-website-template/
BSD 3-Clause "New" or "Revised" License
364 stars 315 forks source link

`cite.py` field merging overwrites values that need to be merged #259

Closed andrewsu closed 6 months ago

andrewsu commented 7 months ago

Checks

Link to your website repo

https://github.com/andrewsu/sulab.org

Version of Lab Website Template you are using

1.2.1

Description

My _data/orcid.yaml file looks like this:

- orcid: 0000-0002-9859-4104
  member: andrew-su
- orcid: 0000-0002-7792-0150
  member: mike-mayers

There are three IDs that are in common between the two ORCID profiles above: doi:10.1093/bioinformatics/btac205, doi:10.1186/s12859-019-3297-0, and doi:10.1021/acs.jproteome.6b00938. However, with the orcid.yaml file above, those three publications only appear on the member page for mike-mayers, not for andrew-su. If I reverse the order of orcid IDs in orcid.yaml, then those three pubs only appear on the member page for andrew-su, not mike-mayers. If I manually edit the citations.yaml file to add two member fields (see example below), then the pub properly appears on both member pages.

- id: doi:10.1093/bioinformatics/btac205
  title: Design and application of a knowledge network for automatic prioritization
    of drug mechanisms
  authors:
  - Michael Mayers
  - Roger Tu
  - Dylan Steinecke
  - Tong Shu Li
  - "N\xFAria Queralt-Rosinach"
  - Andrew I Su
  publisher: Bioinformatics
  date: '2022-04-06'
  link: https://doi.org/gptwsz
  orcid: 0000-0002-7792-0150
  member: mike-mayers
  member: andrew-su
  plugin: orcid.py
  file: orcid.yaml

I think there is an enhancement that could be made in cite.py, but on quick scan I haven't quite figured out the logic flow there yet.

vincerubinetti commented 7 months ago

Okay so this is happening because whichever one comes last will overwrite the previous ones, because they're all the same field names and you can't have duplicate keys in YAML/JSON/etc. The line to look at is here: https://github.com/greenelab/lab-website-template/blob/main/_cite/cite.py#L158

I would say this is a limitation/flaw of the current design. I'm not sure how to solve it at the moment, in a way that is flexible and consistent.


Let's say if cite.py encounters a field that's already been set, instead of overwriting it, we "merge" it into array or comma-separated list. So in your case you'd end up with:

- id: doi:10.1093/bioinformatics/btac205
  member: 
    - mike-mayers
    - andrew-su
  # OR
  member: mike-mayers, andrew-su

That's fine, but it'd make cite.py more complex. And getting each property in the citation component becomes more complex (might need a new Ruby filter under /_plugins). Moreover, which array item do you display when the field in question is date or publisher? I can envision a future user complaining that the wrong one is selected, and asking for a way to choose, adding more complexity. Maybe things like date are cases where you really want to just overwrite, and whichever comes last wins.

At least the list filter logic wouldn't need to change... I think an array will just get coerced to a string, so you could filter for "andrew-su" and it'd be searching member fields that look like ["mike-mayers", "andrew-su"].


What you could do for now is this:

- orcid: 0000-0002-9859-4104
  andrew-su: ""
- orcid: 0000-0002-7792-0150
  mike-mayers: ""

Then your list component filters would look like {% include list.html data="citations" component="citation" filters="andrew-su: .*" %} (shows all citation which have the andrew-su field set to anything).

Although it looks weird, this might end up being my official recommendation (which I'd put in the docs), because it sidesteps all the complexities... Need to think about this more.

vincerubinetti commented 7 months ago

Also I think you can alternatively use the authors field to filter, because as mentioned above, I think an array just becomes concatenated into a single string that is searched like normal.

Downsides:

andrewsu commented 7 months ago

I used the strategy you suggested in https://github.com/greenelab/lab-website-template/issues/259#issuecomment-2079755048 -- seemed most conceptually simple and worked exactly as expected. So now this is what my citations.yaml file looks like for the example above:

- id: doi:10.1093/bioinformatics/btac205
  title: Design and application of a knowledge network for automatic prioritization
    of drug mechanisms
  authors:
  - Michael Mayers
  - Roger Tu
  - Dylan Steinecke
  - Tong Shu Li
  - "N\xFAria Queralt-Rosinach"
  - Andrew I Su
  publisher: Bioinformatics
  date: '2022-04-06'
  link: https://doi.org/gptwsz
  orcid: 0000-0002-9859-4104
  mike-mayers-list: true
  plugin: orcid.py
  file: orcid.yaml
  roger-tu-list: true
  andrew-su-list: true

The only thing to note is that the orcid field is based on the last-searched ID in orcid.yaml -- just noting that in case others will be using that in a list filter (as I will be)... Thanks!

vincerubinetti commented 6 months ago

I think the solution you went with above is the most expected behavior and least error prone. I've added it to the documentation here: https://github.com/greenelab/lab-website-template-docs/commit/3c8358969502dadec269181b88283e581b658bef and https://github.com/greenelab/lab-website-template-docs/commit/7cdf293fe7b4d91e893dcb091ab0f7ff6ffde2ca