Closed drprabhakar closed 9 years ago
It's possible that the page doesn't have matching tags and a browser is adding them automatically.
The \n
character is just a new line character which more than likely exists within the page.
While viewing the extracted raw html, the new line character "\n" is still visible in browser view also Below is the browser view:
What's the web page so that I can test it?
Sample web page which I have tested
http://www.amazon.com/Sony-W800-Digital-Camera-Black/dp/B00I8BIBCW/ref=zg_bs_281052_1
I still don't understand your problem. Do you mean what when you open the extracted json in your browser the page doesn't look like it does on amazon.com?
I have extracted the whole html page source using raw html as item type and Field name as "ProductInfo" and then I have deployed the spider using scrapyd. From the scrapyd Log I have selected the content in "ProductInfo" which is the whole web page source and I saved that content into separate "HTML" file for cross-verification. And then I opened that "HTML" file in browser, there only I am getting the new line character in browser view.
Are you sure that your field type is actually HTML. I've tried to replicate your problem and everything works as expected.
Yes, I am sure. My field type is raw html. Also I have chosen html region from Annotation Options. But the extracted Field content even does not have the html tag.
It gets the content inside the tag so there won't be a HTML tag anyway. Storing the full HTML can only be done with a middleware.
I am extracting whole page source as html using raw html as item type. But after extraction I am getting only partial HTML tags within the content and also start and end tags are missing. Junks like "\n" are extraced and not preserving the format as in website Ex: