internetarchive / Zeno

State-of-the-art web crawler 🔱
GNU Affero General Public License v3.0
83 stars 11 forks source link

Replace github.com/tomnomnom/linkheader with stdlib #85

Closed CorentinB closed 4 months ago

CorentinB commented 4 months ago

github.com/tomnomnom/linkheader is being used to parse Link HTTP headers, pretty sure we can replace it with standard lib-only code.

Whatever code we add, we need unit test for it. (we don't have unit tests for that right now)

HarshNarayanJha commented 4 months ago

I would like to work on this issue. How can I check if the crawling is working after the change?

Is it go run . get url https://some.url

Whatever URL I try I get

panic: runtime error: slice bounds out of range [:7] with length 6
nick2432 commented 4 months ago

can I work on this?

CorentinB commented 4 months ago

Hey guys, I like the enthusiasm a lot, but I wonder, where do you guys are coming from? We didn't have any contributors for years and out of the blue you are all popping up :) Where did you learn about Zeno?

@HarshNarayanJha I'm looking into that panic right now..

HarshNarayanJha commented 4 months ago

😆 Honestly I just went to goodfirstissues and found this repo. Intestingly this repo is owned by archive.org (lifesaver sometimes) so I decided to contribute

I had never known anything about Zeno before or even it existed!

CorentinB commented 4 months ago

😆 Honestly I just went to goodfirstissues and found this repo. Intestingly this repo is owned by archive.org (lifesaver sometimes) so I decided to contribute

I had never known anything about Zeno before or even it existed!

Sounds good! Well if you want to tackle that one, it's fine for me, be aware that we would need unit tests for whatever function you might add to replace the behavior of that library we are using.

I am assigning this issue to you.

(got distracted but I'm coming back to your panic asap..)

CorentinB commented 4 months ago

@HarshNarayanJha the panic is fixed: https://github.com/internetarchive/Zeno/pull/91

HarshNarayanJha commented 4 months ago

Wow! That was quick! So it was a minor hash/version slicing issue after all! I will update this tomorrow (9PM, gotta sleep soon)

HarshNarayanJha commented 4 months ago

@CorentinB Any Example URLs which have the link header? I tried google.com, archive.org and even example.com, but they don't have the link header

var (
        links      = linkheader.Parse(resp.Header.Get("link"))       // <---- this right here
        discovered []string
    )

    for _, link := range links {
        discovered = append(discovered, link.URL)
    }

It seems to me that it is a list of related URLs for the page. I probably can implement the parsing with the empty list, but it would be helpful otherwise

CorentinB commented 4 months ago

@CorentinB Any Example URLs which have the link header? I tried google.com, archive.org and even example.com, but they don't have the link header


var (

      links      = linkheader.Parse(resp.Header.Get("link"))       // <---- this right here

      discovered []string

  )

  for _, link := range links {

      discovered = append(discovered, link.URL)

  }

It seems to me that it is a list of related URLs for the page. I probably can implement the parsing with the empty list, but it would be helpful otherwise

I think you could try by writing a unit test first, so you would have a response with a link header!

HarshNarayanJha commented 4 months ago

You can check it now #95