nasa-jpl-memex / memex-explorer

Viewers for statistics and dashboarding of Domain Search Engine data
BSD 2-Clause "Simplified" License
121 stars 69 forks source link

nutch fetched url counts are not being updated #732

Closed ahmadia closed 8 years ago

ahmadia commented 8 years ago

Needs a discussion with Brittain, I don't think this is hard but it is pretty useful. It would be good to chat with the Nutch folks and ask them what other kinds of things are available and make sense to put on the dashboard while we're tweaking it.

ahmadia commented 8 years ago

So here's what's available from a call to "stats" from crawldb. I need to expose this in nutch-python, spin out a new build, then land a patch here.

  {
      "retry 0":"8350",
      "minScore":"0.0",
      "retry 1":"96",
      "status":{ 
                "3":{"count":"21","statusValue":"db_gone"},
                "2":{"count":"594","statusValue":"db_fetched"},
                "1":{"count":"7721","statusValue":"db_unfetched"},
                "5":{"count":"86","statusValue":"db_redir_perm"},
                "4":{"count":"24","statusValue":"db_redir_temp"}
                },
      "totalUrls":"8446",
      "maxScore":"0.528",
      "avgScore":"0.029593771"
  }
ahmadia commented 8 years ago

https://issues.apache.org/jira/browse/NUTCH-2154

ahmadia commented 8 years ago

Easy workaround in nutch-python for this so proceeding.

ahmadia commented 8 years ago

Fixed in https://github.com/memex-explorer/memex-explorer/commit/2b9114da438fde89ba9a921dede244d4b8b9d764