scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.31k stars 1.4k forks source link

Junks in raw html #183

Closed drprabhakar closed 9 years ago

drprabhakar commented 9 years ago

I am extracting whole page source as html using raw html as item type. But after extraction I am getting only partial HTML tags within the content and also start and end tags are missing. Junks like "\n" are extraced and not preserving the format as in website Ex:

            </ul>\n\n\n\n\n\n\n\n\n<strong>Options:</strong>
ruairif commented 9 years ago

It's possible that the page doesn't have matching tags and a browser is adding them automatically. The \n character is just a new line character which more than likely exists within the page.

drprabhakar commented 9 years ago

While viewing the extracted raw html, the new line character "\n" is still visible in browser view also Below is the browser view: issue

ruairif commented 9 years ago

What's the web page so that I can test it?

drprabhakar commented 9 years ago

Sample web page which I have tested

http://www.amazon.com/Sony-W800-Digital-Camera-Black/dp/B00I8BIBCW/ref=zg_bs_281052_1

ruairif commented 9 years ago

I still don't understand your problem. Do you mean what when you open the extracted json in your browser the page doesn't look like it does on amazon.com?

drprabhakar commented 9 years ago

I have extracted the whole html page source using raw html as item type and Field name as "ProductInfo" and then I have deployed the spider using scrapyd. From the scrapyd Log I have selected the content in "ProductInfo" which is the whole web page source and I saved that content into separate "HTML" file for cross-verification. And then I opened that "HTML" file in browser, there only I am getting the new line character in browser view.

ruairif commented 9 years ago

Are you sure that your field type is actually HTML. I've tried to replicate your problem and everything works as expected.

drprabhakar commented 9 years ago

Yes, I am sure. My field type is raw html. Also I have chosen html region from Annotation Options. But the extracted Field content even does not have the html tag.

ruairif commented 9 years ago

It gets the content inside the tag so there won't be a HTML tag anyway. Storing the full HTML can only be done with a middleware.