Closed ranguard closed 6 years ago
Hello, @ranguard
I cant find directory (t/) with tests. Could i create directory for tests? Or what is the best place for sitemap checker?
Could I use additional package LWP::Parallel::UserAgent for tests or test's speed does not matter?
What should be the result of the script work? Plain text? HTML?
Hi @GermanS,
This whole repo is tests, or maybe better explained as monitoring...
https://github.com/metacpan/metacpan-monitoring/blob/master/README.md
Explains it a bit - it was setup by @melezhik who may be around to help, it's writting using his module https://metacpan.org/pod/swat
Does that help?
Hi guys! Please let me know if you need any help with swat/metacpan monitoring.
@melezhik thanks! Could you give @GermanS some guidance on how to add some tests that A) ensure that the two sitemaps files exist and possibly B) check that they contain valid XML?
HI @GermanS, you may do this:
$ mkdir sitemap-authors.xml.gz
$ nano sitemap-authors.xml.gz/hook.pm
use XML::LibXML;
use IO::Zlib;
set_response_processor(
sub {
my $headers = shift; # original response, http headers, String
my $body = shift; # original response, body, String
open(my $out, '>:raw', test_root_dir()."/file.gz") or die "Unable to open: $!";
print $out $body;
close($out);
my $fh = new IO::Zlib;
my $xml="";
if ($fh->open(test_root_dir()."/file.gz", "rb")) {
$xml = join "", <$fh>;
$fh->close;
}
eval {
XML::LibXML->load_xml( string => $xml );
};
if ($@){
return "Headers: $headers\nwrong XML: $@"
} else {
return "Headers: $headers\ncorrect XML"
}
});
$ nano sitemap-authors.xml.gz/get.txt
200 OK
correct XML
And so on for other xml. This should be enough.
I have committed the example, but working code here - https://github.com/metacpan/metacpan-monitoring/tree/check-sitemap
Hi, @melezhik
There are 2 ways of monitoring the availability of URL's.
Test script (sitemap-releases.xml.gz/hook.pm) creates template directory and runs swat for example
> mkdir sitemap-releases.xml.gz/any
> echo 200 > sitemap-releases.xml.gz/any/get.txt
> swat sitemap-releases.xml.gz/any https://metacpan.org/release/AI-XGBoost/
This way uses swat features, but it saves almost all pages from website to ~/.swat/ and the size of home directory increases.
Test script (sitemap-releases.xml.gz/hook.pm) uses LWP package and checks server response code.
Please, make a code review https://github.com/metacpan/metacpan-monitoring/compare/master...GermanS:master Alexey, which way to choose?
Hi @GermanS , you can use swat_purge_cache
to purge cache files:
$ cat ~/swat.ini
swat_purge_cache=1
This address the issue with cache files size getting increased.
I have updated code in branch to add runtime extraction author urls from xml and check those urls - https://github.com/metacpan/metacpan-monitoring/commit/4b3e65fda233067418814386ce02214f95ab6a17 , this is pure swat implementation without lwp agent.
the feature is added through #5
https://metacpan.org/sitemap-authors.xml.gz https://metacpan.org/sitemap-releases.xml.gz
Initially just that they are there. Possibly that they are valid sitemap formats Later (as we are not currently updating automatically everywhere) - validate the files are recent.