metacpan / metacpan-monitoring

Monitoring metacpan
1 stars 3 forks source link

Need test to check sitemaps are available #4

Closed ranguard closed 6 years ago

ranguard commented 6 years ago

https://metacpan.org/sitemap-authors.xml.gz https://metacpan.org/sitemap-releases.xml.gz

Initially just that they are there. Possibly that they are valid sitemap formats Later (as we are not currently updating automatically everywhere) - validate the files are recent.

GermanS commented 6 years ago

Hello, @ranguard

I cant find directory (t/) with tests. Could i create directory for tests? Or what is the best place for sitemap checker?

Could I use additional package LWP::Parallel::UserAgent for tests or test's speed does not matter?

What should be the result of the script work? Plain text? HTML?

ranguard commented 6 years ago

Hi @GermanS,

This whole repo is tests, or maybe better explained as monitoring...

https://github.com/metacpan/metacpan-monitoring/blob/master/README.md

Explains it a bit - it was setup by @melezhik who may be around to help, it's writting using his module https://metacpan.org/pod/swat

Does that help?

melezhik commented 6 years ago

Hi guys! Please let me know if you need any help with swat/metacpan monitoring.

oalders commented 6 years ago

@melezhik thanks! Could you give @GermanS some guidance on how to add some tests that A) ensure that the two sitemaps files exist and possibly B) check that they contain valid XML?

melezhik commented 6 years ago

HI @GermanS, you may do this:

$ mkdir sitemap-authors.xml.gz
$ nano sitemap-authors.xml.gz/hook.pm

use XML::LibXML;
use IO::Zlib;

set_response_processor(

 sub {

   my $headers   = shift; # original response, http headers, String
   my $body      = shift; # original response, body, String

   open(my $out, '>:raw',  test_root_dir()."/file.gz") or die "Unable to open: $!";
   print $out $body;
   close($out);

   my $fh = new IO::Zlib;
   my $xml="";

   if ($fh->open(test_root_dir()."/file.gz", "rb")) {
      $xml = join "", <$fh>;
      $fh->close;
   }

   eval {
    XML::LibXML->load_xml( string => $xml );
   };
   if ($@){
     return "Headers: $headers\nwrong XML: $@"
   } else {
     return "Headers: $headers\ncorrect XML"
   }
});

$ nano sitemap-authors.xml.gz/get.txt
200 OK
correct XML

And so on for other xml. This should be enough.

melezhik commented 6 years ago

I have committed the example, but working code here - https://github.com/metacpan/metacpan-monitoring/tree/check-sitemap

GermanS commented 6 years ago

Hi, @melezhik

There are 2 ways of monitoring the availability of URL's.

  1. Test script (sitemap-releases.xml.gz/hook.pm) creates template directory and runs swat for example

    > mkdir sitemap-releases.xml.gz/any
    > echo 200 > sitemap-releases.xml.gz/any/get.txt
    > swat sitemap-releases.xml.gz/any https://metacpan.org/release/AI-XGBoost/

    This way uses swat features, but it saves almost all pages from website to ~/.swat/ and the size of home directory increases.

  2. Test script (sitemap-releases.xml.gz/hook.pm) uses LWP package and checks server response code.

Please, make a code review https://github.com/metacpan/metacpan-monitoring/compare/master...GermanS:master Alexey, which way to choose?

melezhik commented 6 years ago

Hi @GermanS , you can use swat_purge_cache to purge cache files:

$ cat ~/swat.ini
swat_purge_cache=1

This address the issue with cache files size getting increased.

melezhik commented 6 years ago

I have updated code in branch to add runtime extraction author urls from xml and check those urls - https://github.com/metacpan/metacpan-monitoring/commit/4b3e65fda233067418814386ce02214f95ab6a17 , this is pure swat implementation without lwp agent.

melezhik commented 6 years ago

the feature is added through #5