Closed theosanderson closed 9 months ago
/seq/[sequence_id].fa
: sounds like a good idea! Maybe we can implement it as a redirect to LAPIS?
.fa
or .fasta
?
I vote against abbreviations. They unnecessarily obfuscate the meaning to users who are not 100% familiar with the topic. Typing 5 additional is not that much effort. Most of the times, people won't write that by hand, but simply copy it from somewhere.
We can redirect /seq
to /sequence
if you like, but I would still keep the full name.
Is fa
a common abbreviation for fasta?
/seq/[sequence_id]
(shows latest version, but without redirect? One can also argue it should just redirect to the latest version.)
Do we really want that? Not having it would require a user to always fix the version, which implies links that really contain what you meant. It's the same as on GitHub: I almost always copy the fixed link including the commit hash, because two commits later, the file might not be there anymore.
I'm just asking to make sure. I'm not totally against having those links.
I am also generally against unnecessary abbreviation, but IMO seq
is so widespread as to be clear.
Here are some screenshots from my Twitter:
Twitter gives you ~15 characters after the domain that will show up. For seq
, depending on the prefix, we might plausibly fit without truncation meaning that in sharing the URL people also immediately communicate the sequence ID. (And probably similar in a lot of different contexts, ~Slack~, etc.) and in e.g. emails you are taking up less space.
On sequence entry URLs: yes, exactly I think one has to think about what the use case is. I think at minimum it's definitely useful to support a URL for a sequence entry as a whole. ~Almost always as a reader I want to be seeing the most recent version of a sequence, because it's the best one. I think typically as a poster of links I also want readers to be seeing the most recent version e.g. say I discover an important new COVID sequence and I post a link to it, but then I discover I made a typo in the Location
and fix it. I want people using my URL to see the fixed sequence.~
~medRxiv do the redirect thing so medrxiv.org/content/10.1101/2023.01.26.23284998 will redirect to medrxiv.org/content/10.1101/2023.01.26.23284998v4. I think that's a legit approach, but it often annoys me because I forget to remove the v4 when posting a link and then months later people end up (accidentally) reading the old versions of my preprints which isn't something I want. I think the use case where one wants to ensure someone sees an older copy of a sequence is similarly pretty rare.~ [Edit strikethrough text remains true, but ultimately I end up in favour of versioned URLs as discussed below]
I dunno though, I'm reflecting, and while that's what I want when I'm posting links, in code where I'm pointing to a .fa
I probably want things to be very consistent. So maybe the redirect option is the best thing. (I.e. that means that the only URLs you see will be versioned - unless someone has specifically decided to deversion them)
I searched github for "sequence.fa" and "sequence.fasta" and fa seemed a bit more common (but hard to know how effective that method is), which chimes with my feeling - but I really don't mind.
Maybe we can offer a "share" button that let's you copy both links?
host.com/seq/<id>
host.com/seq/<id>.<version>
And those links then redirect to `sequences/
I think a significant majority of people are most likely to share by copying the URL. I would be very happy with that if they both ended up at /seq/<id>.<version>
.
I had to learn it, but I know my pages where I should not share links by copying the URL (including confluence and GitHub). Maybe this is also something we want to teach users?
People who value short links are likely to be able to learn that, same as people who are aware that the content behind the link might change. For all others, it probably doesn't matter and we can use the long, non-obfuscated linkname.
The idea of having a separate sharing URL that is short would be lower in my preferences than everyone using the long version for everything, because of consistency and predictability (e.g. if I want to find the comment on slack where I linked to a particular sequence to see what I said about it)
We have regular meetings with a group of researchers who know about Pathoplexus so maybe we just poll them for 30 seconds in the next meeting on /seq/
vs /sequence/
vs /sequences/
?
Yes, good idea, let's do that in our meeting in 20 minutes
(ah I was thinking PHA4GE, but yes if there's enough people here that could work too)
Ah, yes, right! We can get feedback from both groups?
This is the current result of the poll of PHA4GE (bioinformatics people, IMO relatively close to our target audience).
Also @rneher suggested an approach with /isolate/
/nucleotide/
/protein/
, inspired by the multi-segment case. I suggest we briefly postpone the URL question until we have full-segmentation implemented (i.e. very soon) and so we know exactly what we're assigning URLs for
(I voted on seq
in the Slack poll, but I don't have super strong feelings between seq
and sequence
, with minor preference for seq
mostly for shorter URL as Theo points above. I do think redirects between the two are a good idea so that if people can't remember or want to use a short one for URL reasons, both work.)
I would agree with @theosanderson's comment above about how it can be annoying to share a URL on bioRxiv etc which ends up then taking people to an old version forever. I generally think that the most recent version of a sequence is the most useful and this is probably what most people want to end up on, unless explicitly redirected. If I'm remembering right, not specifying a version on Genbank takes you to the most recent version, so that may also have precedent. I think I'd lean towards this (default no version, which can be specified via URL) gently but I could be convinced otherwise.
Segmented viruses does change the game a little. I guess to be more correct, isolate
should be used for everything and then for segmented viruses the sequences
can differ within per segment...?
However, I'll be honest, I don't like that much in my gut... (mostly that then for all other viruses you're using isolate
instead)
Otherwise we could have isolate
for segmented viruses and seq
for not, or use seq
for all and make something else to distinguish the segments of a segmented virus?
I think the segmented thing will be easier to figure out once we've got it implemented - I think a Q is to what extent we actually end up needing individual pages per segment, or at least how prominent those are. I think GenBank came from a single-sequence-centric model whereas we plan to have a more isolate-centric model. (A key distinction is that all our IDs are based on the isolate ID).
I could imagine that in file terms, for segmented viruses:
-/seq/PB_xxxx.fa
would return a multi-sequence fasta with all segments (i.e. the isolate)
-/seq/PB_xxxx_HA.fa
would return the HA segment
You could have similar HTML pages, but in practice the isolate html page might almost be all you need if it links directly to the fastas per-segment.
I tend to think this is OK with the seq
text. Sequences are involved in all cases (seq
isn't explicitly singular or plural).
We conducted a poll of users and there was a 7:1 vote in favour of seq
Then let's go for seq
! I removed the discussion label and moved the issue to prioritized.
Edit (Chaoran):
Decision: Let's change the URL to
/seq
It's great that we are centralising URLs (https://github.com/pathoplexus/pathoplexus/pull/428). But still we are unlikely to want to update URLs after launch so it's good to think through now what the best URL scheme is. Ideally URLs should be short and meaningful.
My proposal would be:
/seq/[sequence_id]
(shows latest version, but without redirect? One can also argue it should just redirect to the latest version.)/seq/[sequence_id].version
(shows specific version)/seq/[sequence_id].fa
or/seq/[sequence_id].[version].fa
provide Fasta output. (I think this would be really nice to have and shouldn't be hard to implement)