Closed PrasannaBrabourame closed 5 years ago
Halåj~
You mean in the url? or the contents of the request body?
Are you able to give me a more detailed example?
Thank you~
@PrasannaBrabour - Try this...
var tr = require('tor-request');
tr.request(url, (err, res, data) => {
data = data.toString().normalize('NFD').replace(/[\u0300-\u036f]/g, '').replace(/[^\w\s]/gi, '')
console.log(data)
})
Please let us know the outcome...thanks!
@knoxcard It is not working it just replace the Unicode character. I need the original character in output
@talmobi In the content request body.
@talmobi @knoxcard Is there any possibilities to send request with header "UTF-8". Let me know
@PrasannaBrabourame yes there is, see the README: https://github.com/talmobi/tor-request#custom-headers
Basically uses the same format as the request
module.
@PrasannaBrabourame - yes I'd suggest doing what @talmobi just stated.
@talmobi Even after setting the custom header having the same issue. Please find the below code. tr.request({ url: ``, headers: { 'user-agent': 'giraffe','Content-Type': 'text/html', 'charset':'utf-8' } }, function (err, proxyres, body) { // body = body.toString().normalize('NFD').replace(/[\u0300-\u036f]/g, '').replace(/[^\w\s]/gi, '') if (err) { console.log(err) } if (!err && res.statusCode == 200) { res.send(body) } });
@knoxcard Even after trying that also the same error happens
@PrasannaBrabourame - can you post your full code?
@knoxcard
let result = req.body;
tr.request({ url: https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote
, headers: { 'user-agent': 'giraffe','Content-Type': 'text/html', 'charset':'utf-8' } }, function (err, proxyres, body) {
if (err) {
console.log(err)
}
if (!err && res.statusCode == 200) {
res.send(body)
}
});
Halåj~
Related bug fix in newest version tor-request@3.1.0
and working examples at the end of this post.
Looks like this is an encoding issue and not directly related to tor-request
or the request
module. It seems like the resource you are downloading is encoded in ISO-8859-1
encoding format. This leads to your issue because the shorthand call that you are using ( attaching the callback ( err, res, body )
callback function as the last argument ) automatically decoded the body
using utf8
( instead of ISO-8859-1
which your requested resource is encoded in ).
Modern Web browsers follow the WHATWG Encoding Standard which aliases both 'latin1' and 'ISO-8859-1' to 'win-1252'. This means that while doing something like http.get(), if the returned charset is one of those listed in the WHATWG specification it is possible that the server actually returned 'win-1252'-encoded data, and using 'latin1' encoding may incorrectly decode the characters.
ref: https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings
Instead you can use the response
[1] event to convert the chunks/content of the body yourself using for example the iconv-lite
module like this:
tr.request(
{
url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
}
)
.on( 'response', function ( res ) {
console.log( res.headers[ 'content-type' ] ) // 'application/x-research-info-systems;charset=ISO-8859-1'
var chunks = [] // save the raw bytes of the request body ( without decoding it as a string )
res.on( 'data', function ( chunk ) {
chunks.push( chunk )
} )
res.on( 'end', function () {
console.log( chunks )
const str = iconv.decode( Buffer.concat( chunks ), 'ISO-8859-1' )
console.log( str )
// fs.writeFileSync( 'ris', str, { encoding: 'utf8' } )
} )
} )
or you could pipe it directly to stdout:
tr.request(
{
url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
}
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( process.stdout )
or a file stream [2]:
tr.request(
{
url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
}
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( fs.createWriteStream( 'filename.ris' ) )
or any kind of custom or other writable stream:
var ws = require( 'stream' ).Writable()
ws._write = function ( chunk, enc, next ) {
console.log( chunk.toString() )
next()
}
tr.request(
{
url: 'https://www.worldcat.org/title/science/oclc/52065228?client=worldcat.org-detailed_record&page=endnote'
}
)
.pipe( iconv.decodeStream( 'ISO-8859-1' ) )
.pipe( ws )
see the request
docs for details: https://www.npmjs.com/package/request#streaming
You will need to update to the latest version of tor-request
@3.1.0
that fixed a bug that would throw an error when the shorthand callback was undefined.
EDITS:
[1]: fix event name request
-> response
[2]: removed duplicated code block
@talmobi - you deserve a GitHub Nobel Prize, nice work! Thanks for the thorough analysis and latest release version fix!
@PrasannaBrabourame - does this suffice? close ticket?
@talmobi Great work. It is working fine. Thanks a lot both @talmobi and @knoxcard. Kudos to you both.
I am getting empty response from the below url "https://ieeexplore.ieee.org/xpl/downloadCitations?recordIds=5738220&download-format=download-bibtex&citations-format=citation-only"
Invalid character on the tor get request. Bold Character is received in invalid format Original - Antonín Received - Anton�n,