twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 517 forks source link

Use of carriage returns in `tweet_length` #8

Open hundredwatt opened 9 years ago

hundredwatt commented 9 years ago

We encountered an issue where the Twitter.com web interface is reporting a different length than the Ruby gem when a "\r\n" line break is used. eg:

Web interface (3 characters): image

Ruby (4 characters):

> Twitter::Validation.tweet_length "a\r\nb"
=> 4

As a work around, we are going to use a string replace before validating to change "\r\n" to "\n".

Not sure if this should be considered a bug in this gem or a string encoding issue (problem was found on Mac OS 10.10), but wanted to report in case it is a bug. Any thoughts?

twuttke commented 9 years ago

Twitter.com does line feed normalization and strips some invalid characters during the editing process. In other words, it converts "\r\n" into "\n" while you are editing. I'm curious if the server also normalizes line feeds or not. Have you tried tweeting that from another client that does not pre process text as much? Does it stay intact when the Tweet gets rendered? On Dec 30, 2014 6:42 AM, "Jason Nochlin" notifications@github.com wrote:

We encountered an issue where the Twitter.com web interface is reporting a different length than the Ruby gem when a "\r\n" line break is used. eg:

Web interface (3 characters): [image: image] https://cloud.githubusercontent.com/assets/91577/5579378/9307a2cc-9007-11e4-8d62-6c5afe3c4dea.png

Ruby (4 characters): Twitter::Validation.tweet_length "a\r\nb"

As a work around, we are going to use a string replace before validating to change "\r\n" to "\n".

Not sure if this should be considered a bug in this gem or a string encoding issue (problem was found on Mac OS 10.10), but wanted to report in case it is a bug. Any thoughts?

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-text/issues/8.

hundredwatt commented 9 years ago

Just tried tweeting it from a client (using the twitter ruby gem):

> client.update "a\r\nb"
  HTTP POST (84.88ms)   https://api.twitter.com:443/1.1/statuses/update.json
  Request body   status=a%0D%0Ab
  Response status   Net::HTTPOK (200)
  Response body   { ... ,"text":"a\nb", ... }

So it appears that Twitter converts "\r\n" to "\n" in its API. Should we modify the twitter-text library accordingly? A test could be:

    - description: "Valid Tweet: 140 characters (counting 2 character end-of-line as 1 character)"
      text: "A lie gets halfway around the world before the truth has a chance to get its pants on. \r\nWinston Churchill (1874-1965) http://bit.ly/dJpywL"
      expected: true

Here's what happens when I use the test text with the Twitter API:

> text = "A lie gets halfway around the world before the truth has a chance to get its pants on. \r\nWinston Churchill (1874-1965) http://bit.ly/dJpywL"
=> "A lie gets halfway around the world before the truth has a chance to get its pants on. \r\nWinston Churchill (1874-1965) http://bit.ly/dJpywL"
> Twitter::Validation.tweet_invalid?(text)
=> :too_long
> Twitter::Validation.tweet_length text
=> 141
> client.update text
  HTTP POST (93.74ms)   https://api.twitter.com:443/1.1/statuses/update.json
  Request body   status=A+lie+gets+halfway+around+the+world+before+the+truth+has+a+chance+to+get+its+pants+on.+%0D%0AWinston+Churchill+%281874-1965%29+http%3A%2F%2Fbit.ly%2FdJpywL
  Response status   Net::HTTPOK (200)
  Response body   { ... ,"text":"A lie gets halfway around the world before the truth has a chance to get its pants on. \nWinston Churchill (1874-1965) http:\/\/t.co\/Ji7dNuw4bz" ... }
=> #<Twitter::Tweet>

So even though the twitter-text library reports the Tweet is invalid, the Twitter API accepted it.

twuttke commented 9 years ago

Should the length calculation rewrite the tweet text before calculating, or provide a separate "clean tweet text" function that does this. For example, does the api let you tweet 140 of these "\r\n"? If it does, then the server side cleanup happens before the length check. If it does not then the length check is done before the cleanup. That's why it might make sense to clean the tweet before calculating the length and sending it.

The test should ideally not have other potentially variable things in it (like links that might get shortened differntly) How about 140 "\r\n" sequences in a row?

Although not common, some operating systems use only "\r", and some can reverse the order "\n\r" http://en.m.wikipedia.org/wiki/Newline This is the regex Twiter.com uses to normalize line feeds: var LINE_FEEDS_REGEX = /\r\n|\n\r|\n/g although looking at that now, I'm not sure why we aren't using var LINE_FEEDS_REGEX = /\r\n|\n\r|\n|\r/g

These are some of the characters that twitter api will strip out:

var INVALID_CHARS_REGEX = /[\uFFFE\uFEFF\uFFFF\u200E\u200F\u202A-\u202E\x00-\x09\x0B\x0C\x0E-\x1F]/g; The api will only do text.replace("\r\n", "\n") as far as I can tell. Maybe that's the most important one.

There are various other post-processing filters the server does.

On Dec 30, 2014 9:01 AM, "Jason Nochlin" notifications@github.com wrote:

Just tried tweeting it from a client (using the twitter ruby gem):

t.client.update "a\r\nb" HTTP POST (84.88ms) https://api.twitter.com:443/1.1/statuses/update.json Request body status=a%0D%0Ab Response status Net::HTTPOK (200) Response body { ... ,"text":"a\nb", ... }

So it appears that Twitter converts "\r\n" to "\n" in its API. Should we modify the twitter-text library accordingly? A test could be:

- description: "Valid Tweet: 140 characters (counting 2 character end-of-line as 1)"
  text: "A lie gets halfway around the world before the truth has a chance to get its pants on. \r\nWinston Churchill (1874-1965) http://bit.ly/dJpywL"
  expected: true

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-text/issues/8#issuecomment-68374145.

bmeike commented 9 years ago

On Dec 30, 2014, at 9:37 AM, Tom Wuttke notifications@github.com wrote:

Should the length calculation rewrite the tweet text before calculating

Not my circus, not my monkey. … but, please, no.

Calculate the length based on a rewritten copy, if that seems right.

No weird side effect, though, please.

-blake=

hundredwatt commented 9 years ago

Tweeting 140 of the "\r\n"s results in a Forbidden error as the Twitter API considers it a blank status:

> client.update "\r\n" * 140
  HTTP POST (38.29ms)   https://api.twitter.com:443/1.1/statuses/update.json
  Request body   status=%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A%0D%0A
  Response status   Net::HTTPForbidden (403)
  Response body   {"errors":[{"code":170,"message":"Missing required parameter: status."}]}

How about if we change the text to something more generic?:

> "Lorem ipsum dolor sit amet consectetur adipiscing elit.\r\nCum sociis natoque penatibus et magnis dis parturient montes nascetur ridiculus mus.".gsub("\r\n", "\n").size
=> 140
    - description: "Valid Tweet: 140 characters (counting 2 character end-of-line as 1 character)"
      text: "Lorem ipsum dolor sit amet consectetur adipiscing elit.\r\nCum sociis natoque penatibus et magnis dis parturient montes nascetur ridiculus mus."
      expected: true
hundredwatt commented 9 years ago

I started working on this in #9. Feedback and help with other languages appreciated