w3c / spec-generator

Service to automatically generate specs from various source formats
MIT License
25 stars 7 forks source link

Spec generator produces garbled text #537

Closed xfq closed 2 years ago

xfq commented 2 years ago

In https://labs.w3.org/spec-generator/?type=respec&url=https%3A%2F%2Fraw.githubusercontent.com%2Fw3c%2Fclreq%2Fgh-pages%2Findex.html there's garbled text like � �. This affects the PR preview function (see https://github.com/w3c/clreq/pull/455 for example) and has a great impact on the group participants' daily work.

/cc @deniak

deniak commented 2 years ago

I see it's a bug (https://github.com/website-scraper/node-website-scraper/issues/454) with the module we are using to scrap the spec. Unfortunately, this issue has been opened for a few months so we might need to find an alternative.

xfq commented 2 years ago

What about something like Scrapy and node-crawler?

deniak commented 2 years ago

I'd rather stick to a node module if possible. I'll give node-crawler a try. Hopefully I can send a PR by tomorrow.

deniak commented 2 years ago

It was actually surprisingly difficult to find a good module to download relative resources for an HTML document that's not served with the right content type. I ended up parsing the document myself and downloading all the href/src relative to the document. It should be good enough to generate the right snapshot!

xfq commented 2 years ago

I confirm that All Goes Well now. Thank you!