gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
16 stars 10 forks source link

bad sitemap should not return 0 #253

Open valentinedwv opened 5 months ago

valentinedwv commented 5 months ago

While error was logged, the gleaner container returned success, causing scheduler/dagster failed to detect failure.

something wrong with neotomadb... no sitemap, returned HTML page.

"file":"/home/runner/work/gleaner/gleaner/internal/summoner/acquire/resources.go:134","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getSitemapURLList","level":"error","msg":"Error reading sitemap at:http://data.neotomadb.org/sitemap.xmlXML syntax error on line 9: attribute name without = in element","time":"2024-04-01T21:11:26Z"}
{"file":"/home/runner/work/gleaner/gleaner/internal/summoner/acquire/resources.go:75","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.ResourceURLs","level":"error","msg":"Error getting sitemap urls for: neotomadbXML syntax error on line 9: attribute name without = in element","time":"2024-04-01T21:11:26Z"}

SourceStats:
  Start: 2024-04-01 21:11:26.856354745 +0000 UTC m=+0.339619903
  End: 2024-04-01 21:11:26.859096773 +0000 UTC m=+0.342361934
  Soruce:
    - name: neotomadb
      SitemapHttpError: 0 
      SitemapIssues: 0 
      SitemapSummoned: 0 
      SitemapCount: 0 
RunStats:
  Start: 2024-04-01 21:11:26.612979146 +0000 UTC m=+0.096244289
  Reason: Complete
  Soruce:
    - name: neotomadb
      Start: 2024-04-01 21:11:26.856354745 +0000 UTC m=+0.339619903
      End: 2024-04-01 21:11:26.85918183 +0000 UTC m=+0.342446984
      SitemapCount: 0 
      SitemapHttpError: 0 
      SitemapIssues: 0 
      SitemapSummoned: 0 
valentinedwv commented 1 month ago

Hiting this with an r2r sitemap change. Dagster keeps running, because step completes fine.