talmobi / tor-request

light Tor proxy wrapper for request library
http://tor.jin.fi/
311 stars 43 forks source link

Invalid character on the tor get request #43

Closed PrasannaBrabourame closed 5 years ago

PrasannaBrabourame commented 5 years ago

Invalid character on the tor get request. Bold Character is received in invalid format Original - Antonín Received - Antonn,

talmobi commented 5 years ago

Halåj~

You mean in the url? or the contents of the request body?

Are you able to give me a more detailed example?

Thank you~

knoxcard commented 5 years ago

@PrasannaBrabour - Try this...

var tr = require('tor-request');
tr.request(url, (err, res, data) => {
     data = data.toString().normalize('NFD').replace(/[\u0300-\u036f]/g, '').replace(/[^\w\s]/gi, '')
     console.log(data)
 })

Please let us know the outcome...thanks!

PrasannaBrabourame commented 5 years ago

@knoxcard It is not working it just replace the Unicode character. I need the original character in output

PrasannaBrabourame commented 5 years ago

@talmobi In the content request body.

PrasannaBrabourame commented 5 years ago

@talmobi @knoxcard Is there any possibilities to send request with header "UTF-8". Let me know

talmobi commented 5 years ago

@PrasannaBrabourame yes there is, see the README: https://github.com/talmobi/tor-request#custom-headers

Basically uses the same format as the request module.

knoxcard commented 5 years ago

@PrasannaBrabourame - yes I'd suggest doing what @talmobi just stated.

PrasannaBrabourame commented 5 years ago

@talmobi Even after setting the custom header having the same issue. Please find the below code. tr.request({ url: ``, headers: { 'user-agent': 'giraffe','Content-Type': 'text/html', 'charset':'utf-8' } }, function (err, proxyres, body) { // body = body.toString().normalize('NFD').replace(/[\u0300-\u036f]/g, '').replace(/[^\w\s]/gi, '') if (err) { console.log(err) } if (!err && res.statusCode == 200) { res.send(body) } });

PrasannaBrabourame commented 5 years ago

@knoxcard Even after trying that also the same error happens

knoxcard commented 5 years ago

@PrasannaBrabourame - can you post your full code?

PrasannaBrabourame commented 5 years ago

@knoxcard let result = req.body; tr.request({ url: https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote, headers: { 'user-agent': 'giraffe','Content-Type': 'text/html', 'charset':'utf-8' } }, function (err, proxyres, body) { if (err) { console.log(err) } if (!err && res.statusCode == 200) { res.send(body) } });

talmobi commented 5 years ago

Halåj~

Related bug fix in newest version tor-request@3.1.0 and working examples at the end of this post.

Looks like this is an encoding issue and not directly related to tor-request or the request module. It seems like the resource you are downloading is encoded in ISO-8859-1 encoding format. This leads to your issue because the shorthand call that you are using ( attaching the callback ( err, res, body ) callback function as the last argument ) automatically decoded the body using utf8 ( instead of ISO-8859-1 which your requested resource is encoded in ).

Modern Web browsers follow the WHATWG Encoding Standard which aliases both 'latin1' and 'ISO-8859-1' to 'win-1252'. This means that while doing something like http.get(), if the returned charset is one of those listed in the WHATWG specification it is possible that the server actually returned 'win-1252'-encoded data, and using 'latin1' encoding may incorrectly decode the characters.

ref: https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

Instead you can use the response[1] event to convert the chunks/content of the body yourself using for example the iconv-lite module like this:

tr.request(
  {
    url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
  }
)
.on( 'response', function ( res ) {
  console.log( res.headers[ 'content-type' ] ) // 'application/x-research-info-systems;charset=ISO-8859-1'

  var chunks = [] // save the raw bytes of the request body ( without decoding it as a string )
  res.on( 'data', function ( chunk ) {
    chunks.push( chunk )
  } )

  res.on( 'end', function () {
    console.log( chunks )
    const str = iconv.decode( Buffer.concat( chunks ), 'ISO-8859-1' )
    console.log( str )
    // fs.writeFileSync( 'ris', str, { encoding: 'utf8' } )
  } )
} )

or you could pipe it directly to stdout:

tr.request(
  {
    url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
  }
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( process.stdout )

or a file stream [2]:

tr.request(
  {
    url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
  }
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( fs.createWriteStream( 'filename.ris' ) )

or any kind of custom or other writable stream:

var ws = require( 'stream' ).Writable()
ws._write = function ( chunk, enc, next ) {
  console.log( chunk.toString() )
  next()
}

tr.request(
  {
    url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
  }
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( ws )

see the request docs for details: https://www.npmjs.com/package/request#streaming

You will need to update to the latest version of tor-request @3.1.0 that fixed a bug that would throw an error when the shorthand callback was undefined.

EDITS: [1]: fix event name request -> response [2]: removed duplicated code block

knoxcard commented 5 years ago

@talmobi - you deserve a GitHub Nobel Prize, nice work! Thanks for the thorough analysis and latest release version fix!

knoxcard commented 5 years ago

@PrasannaBrabourame - does this suffice? close ticket?

PrasannaBrabourame commented 5 years ago

@talmobi Great work. It is working fine. Thanks a lot both @talmobi and @knoxcard. Kudos to you both.

PrasannaBrabourame commented 5 years ago

I am getting empty response from the below url "https://ieeexplore.ieee.org/xpl/downloadCitations?recordIds=5738220&download-format=download-bibtex&citations-format=citation-only"