vctfence / scrapbee

Mozilla Public License 2.0
40 stars 22 forks source link

Capture of page can't finish #17

Closed Kerenok closed 4 years ago

Kerenok commented 5 years ago

Sometimes, capturing a page can't terminate with the popup staying at "Saving data...." (example https://www.babelio.com/livres-/nature-writing/932 ) : the first bunch of images are saved, the rest is marked as "buffered" but the download doesn't seem to start.

The "index.html" is buffered last so when try to seen the incomplete downloaded page, the first level content is missing. Maybe "index.html" should be saved first (without testing the dependencies) then saved again when all the dependencies are downloaded.

vctfence commented 5 years ago

Hi, I think this is normal for this kind of add-on when there's a network problem. Please try to make another capture to see what happens.

Maybe I should enable user ignor certain resources by click buttons, especially resources always broken.

Kerenok commented 5 years ago

I tried several time to capture this page in different network situations (fiber or cellular, direct or VPN). The download stalled each time after the same number of downloaded images. If it's deterministic, the problem may not be network related.

vctfence commented 5 years ago

Hi, I can capture this page successful at lease once. So I keep my qualified opinion, but I should say it need more observation, Thanks.

Kerenok commented 5 years ago

Hello,

I tried to investigate the problem by putting some traces in content_script.js and background.js.

All HTTP requests seems to run correctly and return blobs but the Promise for saveData doesn't execute all browser.runtime.sendMessage({type: 'SAVE_BLOB_ITEM', item: item}) calls. The res.foreach iteration seems to stop at some time).

In the console windows there're some "too much recursion" errors for background.js among the POST http://localhost:9900/savebinfile logs.

loge-gh commented 5 years ago

Hello,

Same here: https://habr.com/ru/post/406663/

`

type source destination status
image/jpeg https://habrast...52cd747ad5.jpg 763bb536a3e32c09ec215e563e89a1e3.jpeg saved
image/jpeg https://habrast...349279161.jpeg 8bd33f95f7579186b6b444d35194de10.jpeg saved
image/png https://habrast...aqvxrehhfi.png 175fd28d146c9f99c9fc5e07ba3120ab.png saved
image/png https://habrast...qzpm1cw88y.png 3ccd3895d3ca2cb8ca7cc757029c6faf.png saved
image/png https://habrast...h2fchxfjgc.png 3cef471ecde076ebc86ce1afbd195ba2.png saved
image/png https://habrast...l0tqawf-9o.png 73501dc9d36a6bdab3d77f3c444db8bd.png saved
image/jpeg https://habrast...827919713.jpeg b4835efc16a8e69c89ba2b3e06173a30.jpeg saved
image/png https://habrast...qir_e1qarg.png 945be0f1918c1d2ad22e103a31b303f4.png saved
image/png https://habrast...h89mad29l8.png 107f26d936899081f83122e4d41c55ad.png saved
image/jpeg https://habrast...p9vskva84.jpeg d632b00060a3eda04bc69297bff6a751.jpeg buffered
image/jpeg https://habrast...033357911.jpeg 26a3b334c5b400c495d4c8645a8f97c6.jpeg saved
image/jpeg https://habrast...424939534.jpeg 0fc2a34f7575b30ceeef6a566b9c9b7b.jpeg buffered
image/jpeg https://habrast...669799237.jpeg 7daf928a4374463bb0e9cc619a9a6af4.jpeg saved
image/png https://habrast...5355410557.png 17cddfa92274a13239d33b27fc94572e.png buffered
image/jpeg https://habrast...881590625.jpeg 411c2e09a94d05b04c2363f2d3e3fbf5.jpeg saved

...successful rows skipped...

type source destination status
text/html https://photos....3bmK0SwvNPJlp1 3e90afa068d1dd8b0ee8f73d4aed840a.html saved
image/jpeg https://habrast...d4mcgg__s.jpeg c50b65373a57afd42bae542598aaa605.jpeg buffered
image/jpeg https://habrast...jqmkmpdps.jpeg 3a87a0681885b64a2c37147fc7a6bb5b.jpeg buffered
image/jpeg https://habrast...5xsqi_ku8.jpeg 1c8f2fdb0713db6b1ff00e9ad3d613b1.jpeg buffered
image/jpeg https://habrast...owald7itg.jpeg d1e7df0e9feba56eb9000bb7851f88b0.jpeg buffered
image/png https://habrast...b915244532.png a1ee9e01ff62905958bef1250c3dcd3d.png saved
image/jpeg https://habrast...432098581c.jpg 5184907b646ba0ce4541a10799ad93ec.jpeg saved
image/x-icon https://habr.com/favicon.ico favicon.ico saved
CSS index.css index.css buffered
HTML index.html index.html buffered

` I've noted one thing: all the unsaved images are much bigger (size between 1 and 3 Mb) than successfully saved ones (all less than 1Mb).

https://upload.wikimedia.org/wikipedia/commons/f/ff/Pizigani_1367_Chart_10MB.jpg also cannot be captured

Looks like the add-on cannot save images bigger than 1 Mb.

Kerenok commented 4 years ago

Thanks for pushing new versions but with 1.10.0 the problem is still there. The URL of the original post fails consistently : the error seems deterministic.

Scrapbee_bug

zimonth commented 4 years ago

I have the same problem. I've tried several times, on different days on this one page, and it always hung on the same way Kerenok commented above with a screenshot. It usually hangs on different image-files When the page is reloaded by a browser, there seems not to be any problems, all is downloaded and rendered.

The page is aliexpress.com, where I have OrderList open, and in a setting on a bottom I've selected "show 30 items at once". https://trade.aliexpress.com/orderList.htm Screenshot from 2020-01-22 09-40-49 I've succeeded to capture the page if I have only 10 or 20 items visible at the web page, but 30 fails always.

Kerenok commented 4 years ago

@zimonth The URL you provided can't be tested as it requires a subscription to the web site.

Maybe you could try the following public URL https://www.babelio.com/livres-/nature-writing/932 to check that you're experiencing the same problem (more than a 3rd of the images are buffered and never downloaded by Scrapbee).

zimonth commented 4 years ago

Yes, it hungs in the same way. In aliexpress case, I suspect the reason however is not the size of the images, but because there is lots of images to be downloaded. Screenshot from 2020-01-28 14-41-58

Kerenok commented 4 years ago

The behaviour seems to be quite deterministic. When I investigated in the source code, it appeared to be related to an exhaustion of a pool of contexts (maybe linked to the asynchronous nature of Javascript). As the github repository is not updated, I stopped trying to fix an outdated version of Scrapbee!

Kerenok commented 4 years ago

Bug still there after updating to version 1.10.6

vctfence commented 4 years ago

I tried something to get it works better these days, wish this helps for you guys.

Kerenok commented 4 years ago

Great progress : it's now ok with https://www.babelio.com/livres-/nature-writing/932 . Many thanks for improving an already great software!

Kerenok commented 4 years ago

Progress on this issue confirmed in my environment.

A remaining error for a new site http://randochartreuse.free.fr/mobac2.x/index.htm but it may not be the same problem as the scrapping stalls even if there is no image to download (for example, trying to save a one-work selection). It stops trying to download index.css and index.html that are not files from the remote site but are generated by Scrapbee (from my understanding).

Kerenok commented 4 years ago

Same problem with index.css and index.html on a site hosted on github.io https://billw2.github.io/pikrellcam/pikrellcam.html

vctfence commented 4 years ago

Hi @Kerenok , thank you for reporting, please try v1.11.11.

Kerenok commented 4 years ago

Great work: the problem is fixed for the last sites. This issue can be closed as far as I'm concerned. Thanks!

Remaining unrelated issue on https://billw2.github.io/pikrellcam/pikrellcam.html as the background image specified in the <body> tag is not downloaded. Should I open a separate issue for that?

vctfence commented 4 years ago

Glade to know, and let's open another issue.