mastodon / mastodon

Your self-hosted, globally interconnected microblogging community
https://joinmastodon.org
GNU Affero General Public License v3.0
46.09k stars 6.75k forks source link

Dots and slashes break hashtags #19992

Open richvn opened 1 year ago

richvn commented 1 year ago

Steps to reproduce the problem

  1. I was trying to hashtag the DOI of a scientific research paper so that others could discover conversations about it.
  2. I posted: #10.1098/rsos.201617
  3. Trying another way, I posted #doi10.1098/rsos.201617 ...

Expected behaviour

10.1098/rsos.201617 becomes a hashtag. Or doi10.1098/rsos.201617 bcomes a hashtag

Actual behaviour

The desired DOI doesn't become a hashtag

Detailed description

It appears that dots and slashes can't be used in hashtags. DOIs (Digital Object Identifiers) are unique strings for scientific research papers: they have dots and slashes in them. When I try to hashtag a DOI, like 10.1098/rsos.201617 , it doesn't become a hashtag. I'd like to be able to hashtag DOIs so that others can search for conversations about research papers (if the author has hashtagged them and wishes them to be easily discovered).

Specifications

Mastodon 4, chrome browser.

MikeTaylor commented 1 year ago

As more academics make their way into Mastodon, finding some way to make DOIs searchable is going to be really important. Whether or not recognising them as a special-case pattern in hashtags is the right answer, we will need some kind of solution.

mkitti commented 1 year ago

Could they just be URLs? Could we make URLs searchable?

https://dx.doi.org/10.1098/rsos.201617

edit: https://doi.org/10.1098/rsos/201617 is preferred according to https://www.doi.org/factsheets/DOIProxy.html#encoding

jtaylor351 commented 1 year ago

Seems like this would break URLs because URLs can have pound signs in them. I’d recommend just using a URL for the DOI and a more descriptive hashtag

richvn commented 1 year ago

I can see that possible solutions, such as recognising 'doi: ..' as a hashtag despite special characters, or recognising URLs as hashtags, could break other things.

I can't see 'more descriptive hashtags' doing that much to improve discoverability of conversations about a research paper, since users are unlikely to coalesce on the same descriptions for any particular paper, and hashtags won't uniquely characterise any one research paper in the way that DOIs do.

Someone has suggested hashtagging the DOI without dots or slashes (if I'm correctly understanding what they wrote). As a convention, that might help, though fiddly.

MikeTaylor commented 1 year ago

@mkitti Making all URLs searchable would solve this problem, yes, though in a rather clumsy way. (In principle, the http:// is the means of resolution and dx.doi.org is the service that resolves: only the 10.1098/rsos.201617 is actually the ID that you want to search for.)

@richvn I like the idea of special-casing the doi: prefix and recognizing it as introducing a hashtag that runs all the way to the next whitespace. (Probably there are DOIs out there with embedded whitespace, but I think we can ignore such pathological cases.)

larsgw commented 1 year ago

I don't know if I've seen any with whitespace, but I do know that just / and . is not enough for some relatively common DOIs. Some publishers issue DOIs like this one: 10.1002/1520-6394(2000)12:3<118::AID-DA2>3.0.CO;2-G. Even whitespace is technically allowed, if I'm reading this correctly:

The DOI name [...] can incorporate any printable characters from the legal graphic characters of Unicode.

https://www.doi.org/doi_handbook/2_Numbering.html#2.2.3

danbri commented 1 year ago

This would make hashtags here different to many other sites where the range of characters is more limited

So +1 on making DOIs searchable via treating them as links (since they are links)

rossmounce commented 1 year ago

just to note that the ugly type of DOI, which includes many potential break characters that @larsgw is referring to is a SICI containing DOI. They exist. They are real. They are definitely harder to handle for devs :(

Serial Item and Contribution Identifier (SICI) https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier

nichtich commented 1 year ago

The canonical form of a DOI is its URI which equals its URL. Any other syntax such as prepending # or doi: will not scale.

Apart from that the academic publication system is broken anyway. Instead of searching for DOI, each publication should have an ActivityPub inbox or something like that.

afontenot commented 1 year ago

I'm not sure that hashtags are a flexible enough solution to the problem you're trying to solve here. Also kind of skeptical that Mastodon developers will be interested in adding link search (although your instances would be free to implement that yourselves).

Maybe a reasonable workaround would be a bot that subscribes to the #doi hashtag on your instances, collects any DOI URIs in posts there, and updates an external database mapping each URI to a list of posts that contain it. It would be extremely easy to build a search tool that way and you could have some additional features, e.g. sorting by the size of the conversation or what have you.

jtaylor351 commented 1 year ago

Totally agree with @afontenot! The lack of searchability on Mastodon is a feature not a bug. Allowing searching by links would have dramatic consequences far beyond the relatively niche problem of DOIs (e.g. everyone mentioning a news article would be able to find one another)

MikeTaylor commented 1 year ago

I get the reasoning behind general non-searchability being a feature. But here we're talking about people mentioning specific identifiable resources by identifier, a scenario where — just when using hashtags — discoverability is explicitly intended.

bnlawrence commented 1 year ago

I think there is a clear requirement: find all the conversations which are discussing a DOI ... but it's not obvious we should be providing suggested solutions which break the existing functionality or ethos. I rather like the idea promoted by @afontenot, it feels like a good separation of concerns. Build on, not build in.

MikeTaylor commented 1 year ago

@bnlawrence I agree that this issue may be suboptimal in suggesting a specific change rather than starting from what problem we're trying to solve.

But the problem is a real one, and a very important one for scholars (especially in the sciences). I think that @afontenot's suggested solution, while conceptually neat, is really not a solution: my experience has been that people are far more likely to use facilities within an application than to go somewhere else to use them.

I don't know that we have yet arrived at the best solution, though.

Here's another for the pot, less DOI-specific. What if ##any-sequence/of:non#whitespace<characters was recognized and indexed as a hashtag? In other words, the double-hash prefix would mean that only whitespace can end the tag?

jedbrown commented 1 year ago

From a usability standpoint, people aren't going to write both #doi:10.1234/abc567 https://doi.org/10.1234/abc567 every time they reference a paper (so it can be clicked and find related conversations, and also just follow to the paper). Would it be too hard to only index doi.org addresses for search? I think a usability ideal would be to just write the URL and have clients render as DOI:10.1234/abc5676 where the DOI part is like a hashtag (click gives search results) and the rest links through doi.org to the product.

MikeTaylor commented 1 year ago

@jedbrown Yes, that sounds great to me!

afontenot commented 1 year ago

I think it's unlikely that the Mastodon devs will want to develop the means to search by URL, even for a limited range of URLs, but I emphasize that I can't speak for them. Regardless, your instances would be free to develop such a capability yourselves, as Mastodon is open source software.

I'll point out that there's a full text search site where searching for URLs appears to "just work" - you need quotes around the URL to get exact matching. Here's an example doi link search: https://fedsearch.io/?q=%22doi.org%2F10.1093%2Fsysbio%2Fsyac072%22

Maybe this is enough to help some of you until the current madness dies down a bit.

Note that entire instances as well as individual users are able to opt out if they don't want their posts to be indexed.

Edit: it is also entirely up to the instance admin whether users are opted-out of search indexing by default. I think there will probably end up being a big debate about this sort of fediverse indexing, but I think the crucial thing is that if you see your posts on this site and haven't opted in to anything, that's because of the settings your instance admin set up the instance with.

MikeTaylor commented 1 year ago

Promising, @afontenot, thanks: https://fedsearch.io/?q=10.7717%2Fpeerj.12810

mfenner commented 1 year ago

I second @danbri that DOIs are just URLs and just treated as such. The funny DOIs (SICIs) are a thing of the 1990s, the exist, but are rather uncommon in new content. Hashes in URL are common in URLs, but handled the same with DOIs (by the client not the server).

The big issue with finding scientific content in Mastodon is that the vast majority of links (at least that is the experience with Twitter) don‘t use the DOI, but the URL of the website. An extra step is required, and is done for years by organizations such as Altmetric or Crossref who collect altmetrics.

MikeTaylor commented 1 year ago

I beg your pardon; DOIs are not just URLs. 10.7717/peerj.12810 is a DOI; https://doi.org/10.7717/peerj.12810 is a URL, and not even a unique one. http://doi.org/10.7717/peerj.12810 is equivalent; so are https://dx.doi.org/10.7717/peerj.12810 and http://dx.doi.org/10.7717/peerj.12810

We need a way of searching for the DOI itself, not for one specific URL that happens to resolve a DOI. Or if not that, then at least a hack so that searching for any of these forms of the URL also finds all the other forms (plus probably others that I don't know about).

I do agree though that we can ignore those hideous SICI DOIs.

mfenner commented 1 year ago

@MikeTaylor we seem to disagree here on how this could best be implemented.

MikeTaylor commented 1 year ago

Well, I am not too hung up on how it's implemented.

To me, the bottom line here is that we need a way to have unique references to articles searchable, so that people can locate mentions of articles of interest.

The obvious way to do that is by indexing DOIs, but it's far from obvious how that should be done. There are lots of heuristics that could be applied, including groady ones like: any hashtag that begins 10. should be interpreted as extending to the next whitespace character.

I do also want to admit that there has been a big push to express DOIs as HTTPS URLs using the domain-name and (empty) path of the canonical server. I have always felt that this is conceptually wrong, but I recognise that a lot of people don't agree. To me, the fact that each DOI has at least four equivalent URL expressions feels like a slam-dunk argument, but still somehow not everyone is convinced :-) So that sub-issue clouds what we're trying to discuss here.

In my next comment, I will try to lay out all the candidate solutions to this problem.

MikeTaylor commented 1 year ago

Ways to make DOIs searchable in Mastodon, in no particular order.

  1. Change the existing definition of what's included in a hashtag, so that dots and slashes are included. Example: #doi:10.1098/rsos.201617. Advantages: easy to use, no new concepts. Disadvantages: backwards-incompatible change and a lot of people may be depending on the old behaviour.

  2. Special-case the hashtag-parsing rule for DOIs, which may be recognised either by a doi: prefix. Example: #doi:10.1098/rsos.201617 (same as before). Advantages: easy to use. Disadvantages: inelegant exception to usual behaviour, though unlikely in practice to surprise anyone.

  3. A new kind of hashtag beginning ##, where the parsing rules go up to the next whitespace. Example:##doi:10.1098/rsos.201617. Advantages: explicit, unsurprising, no special cases, may be useful in other contexts. Disadvantages: new concept to learn, slightly awkward to use.

  4. Express DOIs as URLs and index all URLs. Example: https://doi.org/10.7717/peerj.12810. Advantages: easy to use. Disadvantages: people would expect clicking on a URL to navigate to it. Backwards-incompatible change. Four different forms of DOI URL would either be four different tokens our would need special-case canonicalization to make them behave as one.

  5. Express DOIs as URLs and index DOI URLs only. Example: https://doi.org/10.7717/peerj.12810. Advantages: easy to use. Disadvantages: horrible special case, with fairly sophisticated URL-sniffling. People would still expect clicking on a URL to navigate to it. Clicking behaviour different for these URLs than for all others. Same URL-canonicalization issues as in #4.

Have I covered all the possible solutions to making DOIs searchable? Which of these seems to offer the best trade-off between the advantages and disadvantages?

twi001 commented 1 year ago

I suggest using the short DOI (https://shortdoi.org/). For example, https://doi.org/10.7717/peerj.12810 has the short DOI https://doi.org/jkzk So, a unique hashtag would be #DOIjkzk.

MikeTaylor commented 1 year ago

@twi001 I'm sorry to say I really don't like that at all. As I've argued upthread, DOIs are identifiers. The fact that some specific URL-shortening service happens to redirect to the same destination as a given DOI doesn't make its proprietary code a substitutable identifier.

JesseWeinstein commented 1 year ago

@MikeTaylor Could you split up your (very good) comment about possible implementations into separate posts for each option, to make it easier for people to add :+1: (or :-1: for that matter) to those they approve of. I know this isn't a vote, but getting a sense of support might be useful in guiding devs towards what would be most useful to work on implementing.

MikeTaylor commented 1 year ago

That's a great idea, @JesseWeinstein. It's 3am here, so I'll leave it till the morning if that's OK. But I will do it.

twi001 commented 1 year ago

@MikeTaylor, please note that the short DOIs are generated and resolved by doi.org rather than a URL shortening service. Click for example on https://doi.org/jkzk corresponding to your example DOI. So, the short DOI is as 'proprietary' as the common long-form DOIs we are using. It is the same website/resolver for long and short DOIs! If one requests a short DOI, the website returns every time the same unique short DOI corresponding to the long form. I don't understand why the short DOI is unacceptable if the long form is OK.

noamross commented 1 year ago

I'd urge everyone here to consider how this laudable open-science practice is in tension with the deliberate virality/discovery-dampening design choices of Mastadon. Consider the following barely hypothetical thread:

The ability to discover is useful, and Altmetrics have their place. However, migration of the scientific community into the fediverse has given us an opportunity to reconsider a sharp binary between all-private and all-indexed/archived model of scientific discourse. I for one am interested in seeing how scientific discourse evolves on the platform where discussion amongst a relatively predictable but somewhat open community is possible with less risk of tumbling into mass screaming match. If one wants to opt into a discoverable/indexed "conversation of record", there are other venues such as PubPeer, or things like field-specific forums.

jguhlin commented 1 year ago

Is it possible to create an extension for Mastodon that would simply regex search for DOI's and automatically link them / allow them for searching? I can see most servers not interested in DOI's but the science servers would be.

For AltMetrics, I think it is going to be web spiders looking across public timelines for DOI's. I think priorities for what we want (metrics, searching for discussion, etc) will determine what we do.

I'm also curious to @nichtich's suggestion about an activitypub inbox for each DOI. This could be something each journal offers then just tagging the DOI? (Or am I completely misunderstanding it?).

mfenner commented 1 year ago

Along those lines, the Fediverse community recently decided to no longer support federated searches (don‘t have link right now).

ShortDOIs are depreciated by the DOI Foundation for several years now. But dealing with link shorteners might be an issue.

mrittmancr commented 1 year ago

Interesting conversation, and at Crossref we'd be interested in identifying when DOIs are mentioned on Mastodon. However, I get the privacy thing and reasons for not having general federated searches. I like the solutions of afontenot jguhlin. Crossref could act as the third party database to hold the mentions of each DOI (at least for Crossref DOIs).

I'd also caution about this conversation going in the direction of metrics and bean-counting. We should be thinking about the discussions on Mastodon providing context for research: who's interested in it? Has someone made a useful comment, presented a good example of resuse, or refuted the research?

[edit: I'm a Product Manager at Crossref responsible for Event Data]

peterjc commented 1 year ago

Cross reference #12285 (shared urls as "soft hashtags")

Another wrinkle in indexing DOIs is the URL forms can but don't always use escape codes, most commonly / vs %2F.

I therefore wondered if we might use hashtags with dots . as %2E etc (where the client software could do this for you), but it looks like even the percent sign is not allowed? What is allowed (#10728 about following the unicode standard for hashtags was closed)?

twi001 commented 1 year ago

@mfenner, I have heard the argument that ShortDOIs are depreciated before. But I can't find an official statement from the foundation. Could you please provide an official source for the decision to deprecate ShortDOIs that could settle our argument?

mfenner commented 1 year ago

@twi001 this was discussed in a DOI Foundation meeting several years ago. I no longer work for a DOI registration agency (I used to work for DataCite). But I think the case for URL shorteners is not very strong anyway as Mastodon does not have the 280 character limit of Twitter.

nemobis commented 1 year ago

URL shorteners are the scourge of the earth. There's no excuse for shortening a DOI, which is already an identifier.

danbri commented 1 year ago

Putting weird characters into hashtags may have accessibility consequences too

On Fri, 11 Nov 2022 at 13:04, nemobis @.***> wrote:

URL shorteners are the scourge of the earth http://fileformats.archiveteam.org/wiki/URL_shorteners. There's no excuse for shortening a DOI, which is already an identifier.

— Reply to this email directly, view it on GitHub https://github.com/mastodon/mastodon/issues/19992#issuecomment-1311674552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGLFDW7N2VUT6F7RYM3WHY75TANCNFSM6AAAAAARZDB5P4 . You are receiving this because you were mentioned.Message ID: @.***>

jesseskinner commented 1 year ago

I thought of another (not ideal) workaround that is usable right now - there could be a convention of replacing characters with underscores..

So #doi_10_1098_rsos_201617 for doi:10.1098/rsos.201617 Or #10_1098_rsos_201617 for 10.1098/rsos.201617

Would something like that work at all?

bnlawrence commented 1 year ago

I think there are two threads co-existing here:

  1. Should: We shouldn't do this (e.g. @noamross's comment above, supported by quite a few thumbs up), and
  2. How: Here are some ways we could do this.

If I were maintaining Mastodon, I'd want to resolve the "should we" before we "how".

Wrt the former. In conversations elsewhere, we sort of got to a point where (I think) people agreed that if you use a hashtag for a DOI you are absolutely inviting discovery, and you want to engender conversation about the paper in question. It's not about metrics.

So the way to deal with the first issues is that if we accept that we want hashtag:doi to work, we also want (naked) doi to work so that people can have private discussions about papers (to some extent, solving the problem raised by @noamross). Then part of addressing how is ensuring that there are two clear mechanisms for using DOIs (I am not aware that it is not possible to embed a doi as a naked link now, but you would want to ensure that it didn't become impossible in someway).

sneakers-the-rat commented 1 year ago

If i may throw another suggestion in the ring on the "how" side, hopefully without going off topic:

Upthread there was a suggestion for a ## special hashtag that ends with a whitespace, but then there would have to be a third tag if one ever wanted to be able to use whitespace in hashtags. The purpose of a hashtag is to allow explicit discovery, but its form as being alphanumeric characters plus _ without whitespace is mostly an implementation artifact rather than something inherent to the idea of making discoverable, linked tags. This same problem actually played out in early wikis in the 90's and early 2000s with wikilinks, which were originally just any CamelCaseWord with a mid-word capital and no whitespace, but eventually moved towards [[Wikilinks]] (See the discussion on meatball on LinkPattern, CamelCase, FreeLink, and many other pages).

One proposal might be to make [[Wikilinks]] an additional tag syntax. The parser would be relatively straightforward to write (I have written one, and doing standard things like using \[[ to escape literal double brackets would make it so hashtags could be arbitrary series of characters.

Some Benefits:

Downsides:

MikeTaylor commented 1 year ago

I think @sneakers-the-rat's suggestion is excellent, and would fulfil the original requirement nicely as well as having lots of other use-cases.

mitar commented 1 year ago

I think that instead of adding some special syntax and making a way to search that, it would be much more useful to just enable searching on URLs. URLs have defined syntax already. If we want to parse something, we can just be parsing URLs instead of yet another way to express links.

danbri commented 1 year ago

Tags are mostly useful for informally cross-referencing stuff, often across systems. So you might use tags meaningful to you on some mastodon servers but also Pinboard bookmarks, Flickr photos, Instagram, Twitter archives etc. The value of this diminishes if services diverge in their definition for tag formatting

On Fri, 23 Dec 2022 at 14:01, Mitar @.***> wrote:

I think that instead of adding some special syntax and making a way to search that, it would be much more useful to just enable searching on URLs. URLs have defined syntax already. If we want to parse something, we can just be parsing URLs instead of yet another way to express links.

— Reply to this email directly, view it on GitHub https://github.com/mastodon/mastodon/issues/19992#issuecomment-1363976821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGKGVAMRKQR466ZSNSDWOWWE7ANCNFSM6AAAAAARZDB5P4 . You are receiving this because you were mentioned.Message ID: @.***>

Rhyothemis commented 1 year ago

The 'should problem' seems like it is 'solved' by the person posting deciding not to use the proposed DOI hashtag feature.

The negative consequences of siloing by not having the ability to use a DOI hashtag or similar feature seems more of a concern.

mkuhn commented 1 year ago

Regarding @noamross's comment above: Most preprints and papers are not going to be the target of brigading or shouting matches, and therefore disallowing search for DOIs or URLs in all cases is too harsh. Mastodon instances and individual users should have the option to make posts findable. What is the argument against individual users consenting to this?

Ideally all users would have to do is include a valid URL to the paper, that then can get harmonised by Altmetrics or CrossRef.

brendanjones commented 6 months ago

Given full text search is now available, are hashtags needed to search for DOIs? Though even if the DOIs-in-hashtags use case isn't needed, it'd still be nice to not have hashtags broken by dots and slashes.

egonw commented 6 months ago

Given full text search is now available, are hashtags needed to search for DOIs?

I am very much looking forward to the first DOI/Search use case.

sneakers-the-rat commented 6 months ago

Not sure what you mean,

Given full text search is now available, are hashtags needed to search for DOIs?

I am very much looking forward to the first DOI/Search use case.

Not sure what you mean, its a basic need for scholarly communication.

Re above question , it decreases need to change hashtag pattern, but we are still working on a resolver that can index dois from generic URLs, harder than it should be. Probs wont be relevant to most instances but for the ones it is we'll be distributing a patch

egonw commented 6 months ago

Not sure what you mean, its a basic need for scholarly communication.

Let me rephrase that: "the first user". I am not sure it is a basic need, but reading the scientific around an article is indeed very useful part of scholarly communication. What I tried to say: let's hope someone steps up soon that actually uses it.

renchap commented 5 months ago

Is the hashtag part of the identifier? We would prefer not to have special cases in hashtag detection.

As the full text search has been reworked in 4.2, searching for "10.1098/rsos.201617" should work now?