save "complete webpage" page.html and /page

rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

https://rajatomar788.github.io/pywebcopy/

Other

527 stars 106 forks source link

save "complete webpage" page.html and /page #49

Closed MajdMustapha closed 4 years ago

MajdMustapha commented 4 years ago

Hello, Good job you did on such task, however, I was wondering if it's possible to save the HTML page under "page.html" and have all the assets under one folder with the same name "page_files" for e.g. where the latter folder can have js , css and photos etc .. For a given URL : https://www.example.com/page.html can I have this as output? -- example.com | --- page_files ( folder containing all assets js,css ... can be many folders as well) | --- page.html

Thank you in advance

rajatomar788 commented 4 years ago

Hey, The pywebcopy tries to recreate the exact folder structure as of the target site. So to modify the slightest path fragment could cause error-prone scrambling of files. Thus not recommended. If you are desperate then you should subclass the URLTransformer class found in the urls.py module. Then many errors later you would just dump the idea.

MajdMustapha commented 4 years ago

Thank you for the prompt response, I've seen it done here and I have no clue how they did it,any thoughts?
I was hoping pywebcopy can help, but I understand that this can be hard.

rajatomar788 commented 4 years ago

The library you linked is completely different than pywebcopy. So I would suggest refactoring of your code to account for the folder structure.

rajatomar788 commented 4 years ago

I am closing this issue as this is not fully related to pywebcopy.

tybug commented 2 years ago

Will this be possible in pywebcopy7, or will you still not support it? I see that pywebcopy7 has introduced a new 'tree_type' config with one of the values being LINEAR - is this option related to this issue at all? I did try saving a webpage with this option in pywebcopy7, but it didn't produce any different results than HIERARCHY.

For the record, I'd also very much like to see this, as I only care about one-off downloads of a single webpage and nothing more. I'm willing to play around a bit and see if I can implement it myself, if you can point me in the right direction.

rajatomar788 commented 2 years ago

Hey @tybug The tree_type variable is indeed an attempt in this direction. But as of now there are only two modes 'LINEAR' & 'HIERARCHY' which have no diference in case of single webpage but will show effect when used in crawls.

I will try to modify the behavior of LINEAR to match it like this or I can introduce a third option.