marty1885 / tlgs

"Totally Legit" Gemini Search - Open source search engine for the Gemini protocol
https://tlgs.one
MIT License
21 stars 2 forks source link

Crawler misparses URIs #3

Closed spc476 closed 2 years ago

spc476 commented 2 years ago

I'm noticing your crawler is not parsing URIs properly, which is resulting in requests like gemini://gemini.conman.org/boston/2015/04/05-05/mailto:yjonjens@mail.com, gemini://gemini.conman.org/boston/2002/03/javascript:addSidebarPanel() or gemini://gemini.conman.org/boston/2007/05/news:alt.society.generation-x.

marty1885 commented 2 years ago

Ohh... thanks for reporting! I'll patch them before the next crawl.

Thanks a lot!

marty1885 commented 2 years ago

b5a7ed8c828f864afa2c61cb4ed48f7f1137ff1d and 8a2dc020d1ed6245c1599edfbe92251b687867c4 add code to test for these cases and void them. As well as unit tests to make sure avoidance works. I'll close the issue now. Let me know if the issue persists.