Sequence URLs - Githubissues

theosanderson commented 1 year ago

Edit (Chaoran):

Decision: Let's change the URL to /seq

It's great that we are centralising URLs (https://github.com/pathoplexus/pathoplexus/pull/428). But still we are unlikely to want to update URLs after launch so it's good to think through now what the best URL scheme is. Ideally URLs should be short and meaningful.

My proposal would be:

/seq/[sequence_id] (shows latest version, but without redirect? One can also argue it should just redirect to the latest version.)
/seq/[sequence_id].version (shows specific version)
/seq/[sequence_id].fa or /seq/[sequence_id].[version].fa provide Fasta output. (I think this would be really nice to have and shouldn't be hard to implement)

chaoran-chen commented 1 year ago

/seq/[sequence_id].fa: sounds like a good idea! Maybe we can implement it as a redirect to LAPIS?

.fa or .fasta?

fengelniederhammer commented 1 year ago

I vote against abbreviations. They unnecessarily obfuscate the meaning to users who are not 100% familiar with the topic. Typing 5 additional is not that much effort. Most of the times, people won't write that by hand, but simply copy it from somewhere.

We can redirect /seq to /sequence if you like, but I would still keep the full name.

Is fa a common abbreviation for fasta?

fengelniederhammer commented 1 year ago

/seq/[sequence_id] (shows latest version, but without redirect? One can also argue it should just redirect to the latest version.)

Do we really want that? Not having it would require a user to always fix the version, which implies links that really contain what you meant. It's the same as on GitHub: I almost always copy the fixed link including the commit hash, because two commits later, the file might not be there anymore.

I'm just asking to make sure. I'm not totally against having those links.

theosanderson commented 1 year ago

I am also generally against unnecessary abbreviation, but IMO seq is so widespread as to be clear. Here are some screenshots from my Twitter:

Twitter gives you ~15 characters after the domain that will show up. For seq, depending on the prefix, we might plausibly fit without truncation meaning that in sharing the URL people also immediately communicate the sequence ID. (And probably similar in a lot of different contexts, ~Slack~, etc.) and in e.g. emails you are taking up less space.

On sequence entry URLs: yes, exactly I think one has to think about what the use case is. I think at minimum it's definitely useful to support a URL for a sequence entry as a whole. ~Almost always as a reader I want to be seeing the most recent version of a sequence, because it's the best one. I think typically as a poster of links I also want readers to be seeing the most recent version e.g. say I discover an important new COVID sequence and I post a link to it, but then I discover I made a typo in the Location and fix it. I want people using my URL to see the fixed sequence.~

~medRxiv do the redirect thing so medrxiv.org/content/10.1101/2023.01.26.23284998 will redirect to medrxiv.org/content/10.1101/2023.01.26.23284998v4. I think that's a legit approach, but it often annoys me because I forget to remove the v4 when posting a link and then months later people end up (accidentally) reading the old versions of my preprints which isn't something I want. I think the use case where one wants to ensure someone sees an older copy of a sequence is similarly pretty rare.~ [Edit strikethrough text remains true, but ultimately I end up in favour of versioned URLs as discussed below]

theosanderson commented 1 year ago

I dunno though, I'm reflecting, and while that's what I want when I'm posting links, in code where I'm pointing to a .fa I probably want things to be very consistent. So maybe the redirect option is the best thing. (I.e. that means that the only URLs you see will be versioned - unless someone has specifically decided to deversion them)

theosanderson commented 1 year ago

I searched github for "sequence.fa" and "sequence.fasta" and fa seemed a bit more common (but hard to know how effective that method is), which chimes with my feeling - but I really don't mind.

fengelniederhammer commented 1 year ago

Maybe we can offer a "share" button that let's you copy both links?

"latest sequence version": host.com/seq/<id>
"current sequence version": host.com/seq/<id>.<version>

And those links then redirect to `sequences/.?

theosanderson commented 1 year ago

I think a significant majority of people are most likely to share by copying the URL. I would be very happy with that if they both ended up at /seq/<id>.<version>.

fengelniederhammer commented 1 year ago

I had to learn it, but I know my pages where I should not share links by copying the URL (including confluence and GitHub). Maybe this is also something we want to teach users?

People who value short links are likely to be able to learn that, same as people who are aware that the content behind the link might change. For all others, it probably doesn't matter and we can use the long, non-obfuscated linkname.

theosanderson commented 1 year ago

The idea of having a separate sharing URL that is short would be lower in my preferences than everyone using the long version for everything, because of consistency and predictability (e.g. if I want to find the comment on slack where I linked to a particular sequence to see what I said about it)

theosanderson commented 1 year ago

We have regular meetings with a group of researchers who know about Pathoplexus so maybe we just poll them for 30 seconds in the next meeting on /seq/ vs /sequence/ vs /sequences/?

chaoran-chen commented 1 year ago

Yes, good idea, let's do that in our meeting in 20 minutes

theosanderson commented 1 year ago

(ah I was thinking PHA4GE, but yes if there's enough people here that could work too)

chaoran-chen commented 1 year ago

Ah, yes, right! We can get feedback from both groups?

theosanderson commented 1 year ago

This is the current result of the poll of PHA4GE (bioinformatics people, IMO relatively close to our target audience).

Also @rneher suggested an approach with /isolate/ /nucleotide/ /protein/, inspired by the multi-segment case. I suggest we briefly postpone the URL question until we have full-segmentation implemented (i.e. very soon) and so we know exactly what we're assigning URLs for

emmahodcroft commented 1 year ago

(I voted on seq in the Slack poll, but I don't have super strong feelings between seq and sequence, with minor preference for seq mostly for shorter URL as Theo points above. I do think redirects between the two are a good idea so that if people can't remember or want to use a short one for URL reasons, both work.)

I would agree with @theosanderson's comment above about how it can be annoying to share a URL on bioRxiv etc which ends up then taking people to an old version forever. I generally think that the most recent version of a sequence is the most useful and this is probably what most people want to end up on, unless explicitly redirected. If I'm remembering right, not specifying a version on Genbank takes you to the most recent version, so that may also have precedent. I think I'd lean towards this (default no version, which can be specified via URL) gently but I could be convinced otherwise.

emmahodcroft commented 1 year ago

Segmented viruses does change the game a little. I guess to be more correct, isolate should be used for everything and then for segmented viruses the sequences can differ within per segment...? However, I'll be honest, I don't like that much in my gut... (mostly that then for all other viruses you're using isolate instead)

Otherwise we could have isolate for segmented viruses and seq for not, or use seq for all and make something else to distinguish the segments of a segmented virus?

theosanderson commented 1 year ago

I think the segmented thing will be easier to figure out once we've got it implemented - I think a Q is to what extent we actually end up needing individual pages per segment, or at least how prominent those are. I think GenBank came from a single-sequence-centric model whereas we plan to have a more isolate-centric model. (A key distinction is that all our IDs are based on the isolate ID).

I could imagine that in file terms, for segmented viruses: -/seq/PB_xxxx.fa would return a multi-sequence fasta with all segments (i.e. the isolate) -/seq/PB_xxxx_HA.fa would return the HA segment

You could have similar HTML pages, but in practice the isolate html page might almost be all you need if it links directly to the fastas per-segment.

I tend to think this is OK with the seq text. Sequences are involved in all cases (seq isn't explicitly singular or plural).

theosanderson commented 10 months ago

We conducted a poll of users and there was a 7:1 vote in favour of seq

chaoran-chen commented 9 months ago

Then let's go for seq! I removed the discussion label and moved the issue to prioritized.

loculus-project / loculus

Sequence URLs #438