mastodon / mastodon

Your self-hosted, globally interconnected microblogging community
https://joinmastodon.org
GNU Affero General Public License v3.0
46.18k stars 6.76k forks source link

Dots and slashes break hashtags #19992

Open richvn opened 1 year ago

richvn commented 1 year ago

Steps to reproduce the problem

  1. I was trying to hashtag the DOI of a scientific research paper so that others could discover conversations about it.
  2. I posted: #10.1098/rsos.201617
  3. Trying another way, I posted #doi10.1098/rsos.201617 ...

Expected behaviour

10.1098/rsos.201617 becomes a hashtag. Or doi10.1098/rsos.201617 bcomes a hashtag

Actual behaviour

The desired DOI doesn't become a hashtag

Detailed description

It appears that dots and slashes can't be used in hashtags. DOIs (Digital Object Identifiers) are unique strings for scientific research papers: they have dots and slashes in them. When I try to hashtag a DOI, like 10.1098/rsos.201617 , it doesn't become a hashtag. I'd like to be able to hashtag DOIs so that others can search for conversations about research papers (if the author has hashtagged them and wishes them to be easily discovered).

Specifications

Mastodon 4, chrome browser.

brendanjones commented 6 months ago

Is the hashtag part of the identifier?

@renchap No, the hashtag is not part of it. See https://en.wikipedia.org/wiki/Digital_object_identifier#Nomenclature_and_syntax for the nomenclature.

We would prefer not to have special cases in hashtag detection.

Does that include dots and slashes? I don't know the technical implications of recognising those in hashtags.

renchap commented 6 months ago

Ok, so not using this as a hashtag works and can be searchable.

Does this solves the original problem? You put the DOI identified in your status without trying to make it a hashtag, then people can find it using search.

I understand there may be benefits for those to be hashtags, like being able to follow them, but supporting dots and slashes in hashtags may cause a lot of other issues. We have hashtags in URLs for example, which would need the slashes to be encoded properly everywhere, or people might write something like "this is my #hashtag." and do not expect the dot to be part of it.

MikeTaylor commented 6 months ago

I just did a search on the Mastodon server sauropods.win for the doi 10.18435/vamp29394 and it worked just fine. So it does look the actual problem (can't search for DOIs) has been solved by other means. For me, that means the present issue can be closed as WONTDO.

(Is it possible to follow search results as you can follow a hashtag, though?)

peterjc commented 6 months ago

@MikeTaylor could you expand on that? I tried the following search terms:

on the following servers:

I got no hits.

renchap commented 6 months ago

https://fediscience.org/ is running 4.1.9, not 4.2.2, so it would need to update.

For https://sauropods.win, maybe this instance is not aware of the status you are looking for? If no users on this instance follow the account that posted a status containing 10.18435/vamp29394, then you wont be able to find it on the instance.

This search returns results on Mastodon.social.

peterjc commented 6 months ago

Thanks. I presume search results differ if logged in, I see no hits on https://mastodon.social/search (not logged in).

MikeTaylor commented 6 months ago

Indeed, I can verify that a search for 10.18435/vamp29394 (no prefix, no hashtag) does find results on https://sauropods.win/ when logged in, but not when anonymous. That is very surprising to me — is it expected?

(Side-note: the form https://doi.org/10.18435/vamp29394 is very widespread, but is absolutely not best practice. DOIs are meant to be identifiers, not instructions.)

sneakers-the-rat commented 6 months ago

Indeed, I can verify that a search for 10.18435/vamp29394 (no prefix, no hashtag) does find results on https://sauropods.win/ when logged in, but not when anonymous. That is very surprising to me — is it expected?

@MikeTaylor Yes, this is expected. Logged out accounts can't use full-text search https://github.com/mastodon/mastodon/blob/89a8e6e6227eb901b5811d8417d81dc8ab1427a9/app/services/search_service.rb#L85


IMO I think this issue is getting a little off track, and we should split off conversations about DOI search elsewhere because a generalized search for DOIs is almost definitely out of scope for base masto and is much harder than it appears.

So as noted a few comments ago by @peterjc :

I tried the following search terms:

* `https://doi.org/10.18435/vamp29394` (current best practice/convention is the URL)

* `doi:10.18435/vamp29394` (earlier DOI convention)

* `10.18435/vamp29394` (no prefix, no URL)

* `#10.18435/vamp29394` (attempting to use as hashtag)

So there are already a few canonical forms for DOIs that would have to be given an explicit sameness relation/parsed into a single canonical form, but that's like ~5% of the challenge here.

The real problem is that there is no 1:1 mapping between DOIs and the URLs they are supposed to dereference to. I would suspect that 99% of people sharing work are going to share a link to the paper. Maybe some percent of them will use the DOI resolver link, but i would wager most of them would copy and paste the URL in their browser while they're looking at the paper. To make DOI search really work, one would need to write a heuristic parser that fetches the page for each URL, tries to find one of the high number of ways that journals embed DOI metadata, resolve that DOI and confirm it does indeed lead to the same URL (which is ofc difficult since the same page can have many different URLs), and so on. This is something I've been talking with librarians and PID ppl for awhile, and they have basically said "lmao good luck."

So tl;dr accomodating DOI search by allowing special characters in hashtags would solve very little of the underlying problem, and full-text search addresses most of what can be addressed without a reverse DOI resolver in practice. The final leg will require academic instances I think writing their own patch, (which IMO is how the fedi should work! rather than packing all functionality into a single codebase, make a garden of forks)


To me the remaining question is are the devs willing to add a generalized syntax for hashtags that can escape nonstandard characters. I think there are a few good reasons for doing this outside of DOI search which i already described above.

Two syntaxes stand out:

This would require changes in a bunch of places, nonexhaustively:

As well as a bunch of UI sweetening to make using either of the above syntaxes fluid.

I think it would be a decent amount of work, and the gain is uncertain. One pretty clear case where we would want to support nonstandard hashtags (eg. with spaces) is in the case that Tumblr decides to federate, but that also seems like a somewhat separate conversation.

peterjc commented 5 months ago

@MikeTaylor it depends who you ask, CrossRef are very clear that they recommend the HTTPS DOI form now: https://www.crossref.org/display-guidelines/

@sneakers-the-rat yes sadly there are many ways to represent a DOI, even just URL versions including also the older URLs http://dx.doi.org/ and http://doi.org/ (not HTTPS) and as mentioned earlier potential use of URL encoding. In practice I think people would standardize but there is scope for optional normalization should an accademic instance wish to patch this.

I think either the wiki-like or markdown-like syntax you suggest would allow the use of DOIs and other text with special characters as hashtags.

mrittmancr commented 5 months ago

At Crossref we already do a generalised search for DOIs as part of Event Data, trying to overcome the problem that @sneakers-the-rat describes above. Details are at https://www.eventdata.crossref.org/guide/data/matching-landing-pages/. The tricky part is maintaining a list of domains that publishers use. While we'd love everyone to represent DOIs in exactly the same way, but that isn't a realistic expectation.

We haven't had the bandwidth to get back to this issue yet, and work on an agent to collect uses of DOIs on Mastodon. The full-text search function should help a great deal, though. As I understand it, there isn't any longer a need for DOIs to be tagged in a specific way to make them searchable.

sneakers-the-rat commented 5 months ago

omg @mrittmancr is the event data API ready? If you can show me a demo of how to do URL -> DOI resolution I'd happily draft a patch and test it on neuromatch.social. I just haven't been able to figure out how to query for a specific page rather than a whole domain

mrittmancr commented 5 months ago

@sneakers-the-rat the best option would be to drop me an email - mrittman@crossref.org. I can put you in touch with our dev team who can advise how you can get set up. This would be very cool!

MikeTaylor commented 5 months ago

I probably should not have raised the issue of whether DOIs "should" include the https://doi.org/ prefix, but since I did ... the very fact that this has changed from the previously recommended https://dx.doi.org/ shows how wrongheaded it is. Identifiers are for identifying things, not for saying where they can be found. I have tremendous respect for the people at CrossRef and the brainwork that has gone into getting us to where we are now, but that doesn't change the fact that this specific recommendation is Just Plain Wrong.

Anyway ...

It is true that at least the following forms of identifer are "equivalent":

So what we really want — the original motivation for this metastatizing issue — is the ability to search for one of these and find posts the match any of them. That is what's necessary to facilitate constructive discussions, and the discovery of related discusssions.

(Indirecting through any of these to the target page seems to me a completely different issue — and one that would be both much more complex to address and of much less value. So I would prefer that we either drop that subthread of discussion, or open a separate issue for it.) Indirecting to the

shaedrich commented 1 month ago

Ways to make DOIs searchable in Mastodon, in no particular order.

  1. Change the existing definition of what's included in a hashtag, so that dots and slashes are included. Example: #doi:10.1098/rsos.201617. Advantages: easy to use, no new concepts. Disadvantages: backwards-incompatible change and a lot of people may be depending on the old behaviour.

  2. Special-case the hashtag-parsing rule for DOIs, which may be recognised either by a doi: prefix. Example: #doi:10.1098/rsos.201617 (same as before). Advantages: easy to use. Disadvantages: inelegant exception to usual behaviour, though unlikely in practice to surprise anyone.

  3. A new kind of hashtag beginning ##, where the parsing rules go up to the next whitespace. Example:##doi:10.1098/rsos.201617. Advantages: explicit, unsurprising, no special cases, may be useful in other contexts. Disadvantages: new concept to learn, slightly awkward to use.

  4. Express DOIs as URLs and index all URLs. Example: https://doi.org/10.7717/peerj.12810. Advantages: easy to use. Disadvantages: people would expect clicking on a URL to navigate to it. Backwards-incompatible change. Four different forms of DOI URL would either be four different tokens our would need special-case canonicalization to make them behave as one.

  5. Express DOIs as URLs and index DOI URLs only. Example: https://doi.org/10.7717/peerj.12810. Advantages: easy to use. Disadvantages: horrible special case, with fairly sophisticated URL-sniffling. People would still expect clicking on a URL to navigate to it. Clicking behaviour different for these URLs than for all others. Same URL-canonicalization issues as in Public status threading display #4.

Have I covered all the possible solutions to making DOIs searchable? Which of these seems to offer the best trade-off between the advantages and disadvantages?

I like the idea of (3) as it doesn't break existing behavior but isn't limited to DOIs (hardcoding this to a special case usually causes more trouble than it does good).

Problem with this solution is if this hashtag comes at the end of a sentence. One wouldn't be able to end this sentence with a full stop, since it would become part of the hashtag, unless dots are only allowed within the hashtag but not at its end. Similar to what @renchap said:

We have hashtags in URLs for example, which would need the slashes to be encoded properly everywhere, or people might write something like "this is my #hashtag." and do not expect the dot to be part of it.

But coming back to the special case, if this would be to be implemented (big if already), this should work for all kinds of URNs and info URIs as well (can someone think of other things that would be similar enough to be included as well?)

MikeTaylor commented 1 month ago

@shaedrich Thank you for this helpful summary of where we've arrived.

My take: I dislike options 4 and 5 partly because of the special-casing, but also DOIs are properly speaking identifiers rather than URLs. It's true that for pragmatic reasons they are often given in URL form, but I think this is a mistake. The canonical URL for resolving them has already changed once, and so has the preferred protocol. Identifiers should identify, and addresses are something different.

My next least-favourite is option 2, because of the special-casing.

Option 1 would suit me but I agree that the backwards-incompatible change could be surprising, so it should probably be avoided.

That leaves the rather elegant and general option 3 (## to introduce a hashtag that is parsed up to the next whitespace). I like this not only because it addresses the current issue, but also because it could have a lot of other cases as well — for example, if you want to hashtag the musical Oliver!.

If option 3 were to be implemented, I hope that #singleWord and ##singleWord would be equivalent, so that clicking on either version would take you a list of posts that include either version.

shaedrich commented 1 month ago

I like this not only because it addresses the current issue, but also because it could have a lot of other cases as well — for example, if you want to hashtag the musical Oliver!.

Nice example, as it addresses what I pointed out earlier:

Problem with this solution is if this hashtag comes at the end of a sentence. One wouldn't be able to end this sentence with a full stop, since it would become part of the hashtag, unless dots are only allowed within the hashtag but not at its end. Similar to what @renchap said:

We have hashtags in URLs for example, which would need the slashes to be encoded properly everywhere, or people might write something like "this is my #hashtag." and do not expect the dot to be part of it.

However, thinking about it, it leads (as programming issues unfortunately often times do in terms of solving actual use cases) to the problem of when you have a hashtag like ##Oliver!?. With my proposed solution, it would render as "Oliver!?", which properly is not the intention behind that expression. This sure could be solved by using an interrobang (‽ or ⁉), but that doesn't seem like a very viable option to me.

MikeTaylor commented 1 month ago

I think that users sophisticated enough to use ## would be sophisticated enough to leave a space before the punctuation, should it be necessary!

Last night, I went to see ##Oliver! .

shaedrich commented 1 month ago

I have no doubt in them being sophisticated, except for those mistaking the double hashtag syntax for the single hashtag syntax. However, what you propose is merely a workaround since it produces invalid punctuation, unless Mastodon would account for it by displaying it without the space. But that might have other unwanted side effects. So, maybe @sneakers-the-rat is right, and we can only be sure about the user's real intentions while keeping grammar intact when using wiki-style syntax 🤔

MikeTaylor commented 1 month ago

Or how about recognizing #[ as introducing a hashtag that only ends with the next ]? So #[10.1111/j.1475-4983.2007.00728.x] would be a hashtag for that DOI.

(Yes, I know there are psychopathic DOIs containing ] characters — 10.1671/0272-4634(2004)024[0903:LCRSFC]2.0.CO;2 for example — but they are enough of an edge-case that I'm happy to shrug and move on.)

shaedrich commented 1 month ago

Well, if there's no other way, it probably could be parsed like nested tags that are closed in reversed order to their opening tags.

sneakers-the-rat commented 1 month ago

That or

#`weird.url:thing`

which is more in line with what the markdown dialects are doing nowadays (but is more ambiguous than [] which are explicit start/end delimiters

trwnh commented 1 month ago

URIs are typically encased in <> so you could delimit start and end in that way... and on a protocol level i think it should be fine to use whatever you want as the Hashtag.name so a name with #<doi:something> would be feasible... https://datatracker.ietf.org/doc/html/rfc3986#appendix-C

the hard part is actually getting everyone to agree on a common and valid way of doing things. you've mostly got twitter-style hashtags where the name has to be a contiguous word, and then you've got tumblr-style tags that happen to be visually rendered with a hash sign before them and can include pretty much anything except a comma because commas are used to separate tags in their visual editor. i'm sure you could extend it any which way you want.

sneakers-the-rat commented 1 month ago

I like #<>

Thats fully compatible with existing twitter-like hashtags, and theres nothing to stop additional UI elements from mimicking tumblr hashtags and using that format under the hood. Ill take a closer look at serialization and how it would federate this weekend