laiso / site2pdf

Generate comprehensive PDFs of entire websites, ideal for RAG.
MIT License
168 stars 8 forks source link

Prevent Crawling Duplicate URLs with and without Trailing Slashes #4

Closed alea12 closed 3 months ago

alea12 commented 3 months ago

Thank you for the work! I've tested with my website and this fits well with my needs. One issue I had is that this tool crawls the same URL twice, with and without a trailing slash (/):

This could be addressed by updating the conditions of uniqueSubLinks. If this looks good to you, I could submit a PR.

https://github.com/laiso/site2pdf/blob/c3385864ce69c6cac66d6160751cbeff2d73e71a/index.ts#L30-L39

laiso commented 3 months ago

@alea12 URL normalization is a great approach. I have been interested in this as well. Please go ahead and create a pull request. thank you.