timf34 / Substack2Markdown

Download/ export free and premium Substack posts, saving them as Markdown files. Also generates HTML interfaces to allow you to browse and sort the markdown files for each author.
MIT License
135 stars 24 forks source link

Can't download premium files #19

Closed MislavSag closed 1 month ago

MislavSag commented 1 month ago

I have instaleld the package, installed requirements in new VE, set configs and ran:

python substack_scraper.py --url https://substack.com/@vertox --directory F:/substack/vertox --premium

I get an error:

Error fetching sitemap at https://substack.com/@vertox/sitemap.xml: 404 Falling back to feed.xml. This will only contain up to the 22 most recent posts. Error fetching feed at https://substack.com/@vertox/feed.xml: 404

DevTools listening on ws://127.0.0.1:64871/devtools/browser/9f57c8cb-cd60-4d0a-9923-51fa6b49fc74 [2620:37164:1017/135248.723:ERROR:edge_auth_errors.cc(523)] EDGE_IDENTITY: Get Default OS Account failed: Error: Primary Error: kTokenRequestFailed, Secondary Error: kTokenFetchUserInteractionRequired, Platform error: -2146893042, hex:8009030e, Error string: Error code: 0x8009030e, error message:Error

0it [00:00, ?it/s]

After first error the browsers opened and enter to substack home page and just exit.

timf34 commented 1 month ago

That's the users account, not their substack. This is their substack: https://www.vertoxquant.com/?utm_source=substack&utm_medium=web&utm_campaign=substack_profile

MislavSag commented 1 month ago

Thanks, I dind't know what url to set. I change that but still get an error

Command:

python substack_scraper.py --url 'https://www.vertoxquant.com/?utm_source=substack&utm_medium=web&utm_campaign=substack_profile' --directory F:/substack/vertox

The error:

Created md directory F:/substack/vertox/vertoxquant Created html directory substack_html_pages/vertoxquant Traceback (most recent call last): File "substack_scraper.py", line 539, in main() File "substack_scraper.py", line 513, in main scraper = SubstackScraper( File "substack_scraper.py", line 345, in init super().init(base_substack_url, md_save_dir, html_save_dir) File "substack_scraper.py", line 89, in init self.post_urls: List[str] = self.get_all_post_urls() File "substack_scraper.py", line 95, in get_all_post_urls urls = self.fetch_urls_from_sitemap() File "substack_scraper.py", line 111, in fetch_urls_from_sitemap root = ET.fromstring(response.content) File "C:\Users\Mislav\AppData\Local\Programs\Python\Python38-32\lib\xml\etree\ElementTree.py", line 1320, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 39, column 33

timf34 commented 1 month ago

Sorry don't have time to look at this right now, but try give it the URL without all the extra stuff at the end

timf34 commented 1 month ago

Like just https://www.vertoxquant.com/

Let me know how you get on

MislavSag commented 1 month ago

Works with short url.

Thanks a lot.

timf34 commented 1 month ago

Glad to hear, give the project a star if you found it useful!