vctfence / scrapbee

Mozilla Public License 2.0
39 stars 23 forks source link

Problem capturing PDF page #19

Closed Kerenok closed 5 years ago

Kerenok commented 5 years ago

Scrapbook can successfully capture PDF pages (for example, https://www.isprs.org/proceedings/XXXVI/5-C53/papers/FPL010.pdf ). In this situation, maybe Scrapbee should just send the raw file to the backend without trying to execute any script on the page (function scriptsAllowed).

vctfence commented 5 years ago

Hi, the problem is how we can know the content of a tab is PDF, and how to fetch the PDF, I need to find out if there's an avaliable way.

The reason that It's maybe a response for a POST-request with necessary parameters. or cookies needed, So I need to check/fetch the content currently loaded.

As I know we can not fetch anything about tab content with tab APIs, That's why content script is needed.

Kerenok commented 5 years ago

Maybe doing a HEAD-request would allow to access the Content-Type of the page and save it accordingly?

vctfence commented 5 years ago

This should work, but not for all of the scenarios I think, because response does not always = content, this is true even for common HTML page, the content maybe chaged by js after HTML loaded.

vctfence commented 5 years ago

Now I suggest to download the pdf manually, open it with firefox and capture it as a bookmark