Closed jonasstein closed 6 years ago
Hmm, i just checked the package with srctest and now it's correct.. (it's online) I don't think it's a real false positive in case of being wrongly matches as offline, because maybe the package was really offline at the time of testing, even though that is unlikely because the other packages from that site were online. Maybe i just had some troubles with my internet connection that time, or it's because of the use of parallel.. At least the script seem to work correctly.. I will keep an eye on this..
However, don't forget srctest (and actually every other script too) just make a snapshot-test of every package, which doesn't represents the status quo. :)
OK, forget what i just say.. it's clearly a parallel problem: The full list show's the the file has two status:
available|dev-tex/latex2html;http://mirrors.ctan.org/support/latex2html/latex2html-2017.2.tar.gz;tex@gentoo.org:
not_available|dev-tex/latex2html|http://mirrors.ctan.org/support/latex2html/latex2html-2017.2.tar.gz|tex@gentoo.org:
Hmm, that will be interesting to fix ..
Ok, i found a Bug in my srctest script. Packages available didn't use the new delimiter yet. Even though i'm not sure if this caused this problem, i will wait until tomorrow because then it's easier to check the result.
On a similar note, I've also checked the wwwtest result.
If there would be a similar problem (a website with 2 stati) the output of
curl -s http://gentoo.levelnine.at/wwwtest/full.txt|cut -d'|' -f4|sort -u|wc
and
curl -s http://gentoo.levelnine.at/wwwtest/full.txt|cut -d'|' -f1,4|sort -u|wc
would be different. In fact, at the moment it is different, however only by one line, which is because there are 5 lines which doesn't use the delimiter (search for "200 http"). I don't know yet why that happend, however, i don't think it's related to parallel, as i think it would happen more often then.
OK, today we got a correct result and i could check it with the above command. Gladly the difference is just 5 packages who have different stati. Considering i'm checking about 22000 packages each run i think this is negligible and probably will eliminated anyway because i found a different "problem".
The more interesting aspect of my checks is the output of the above commands once with sort -u
and once with only sort
The output shows that i have nearly 10000 duplicates, which as for now get checked each time as well! Unlike wwwtest, which first checks a tmpfile for already checked homepages, i didn't implemented that in srctest. The reason behind was that i though that every new package(version) also have a new package to download. What i haven't though about were revisions of packages, patchfiles which get applied on multiple version and probably other cases were files get used multiple times.
After all, the good news is, this will be fixed tonight. :) And the even better news is, this will improve duration time a lot. 10k packages less to check will clearly speed up the script again. Looking forward how much faster it will be :)
This is another false positive (probably fixed tomorrow, or a different reason)
http://gentoo.levelnine.at/srctest/sort-by-maintainer/lxqt_at_g.o.txt
x11-misc/obconf-qt|https://dev.gentoo.org/~jauhien/distfiles/obconf-qt-0.9.0_p20150729.tar.gz|lxqt@gentoo.org:
The package version is missing there too. x11-misc/obconf-qt-0.9.0_p20150729
Hi,
The package version will be included now too. I just started the script again as it's gone wild over the night (somehow the full listening grow to about 5G!). I'm not sure whats the reason for that, but i made some minor changes. We will see what happen next.
looks good now.. :)
not yet,
http://gentoo.levelnine.at/wwwtest/sort-by-maintainer/sebastian%2Bdev_at_ramacher.at.txt
I can see https://pwmt.org/projects/zathura/
, but 000
means server not available, right?
Not necessarily, 000
means the script didn't got a response (http code) from the website. The timeout for this is 10 seconds.
To be clear here 000
isn't a http code, it's just placeholder for no status.
in http://gentoo.levelnine.at/srctest/sort-by-maintainer/proxy-maint_at_g.o.txt
sys-process/minit|minit-0.10|http://dl.fefe.de/minit-0.10.tar.bz2|aw-gentoo@instandbesetzt.net:proxy-maint@gentoo.org:
works for me.
Could we have the status number back in the log? Perhaps it helps to track false positives better and find a solution.
more false positives: https://bugs.gentoo.org/637012
http://www.skarnet.org in https://gentoo.levelnine.at/full-sort-by-maintainer/williamh_at_g.o.txt are false positives
NP had an idea: It could be that we trigger a server limitation by our parallel fetches. It would be interesting to fetch in a mixed way, so that we do not fetch 8 times from one server in parallel.
Hi,
After looking for your false positives i finally found some problems with the script. :) To explain what went wrong i first gonna explain how the script works. srctest uses wget's spider functionality to get the HTTP Status code from each file it's checking. If it finds a particular Text (which was: HTTP/1.1 200 OK) it decides the package must be online. However, your false positve returns a slighly different Statuscode. So far i got: HTTP/1.0 200 OK HTTP/1.1 200 Coming Up Since i didn't check for those it was marked as unavailable.
I've now changed the script and only check for the Text: 'Remote file exists.' which also get printed by wget's spider. This should fix a whole bunch of false positives.
Nope, this only fixes srctest. wwwtest works differently.
great, it seems to work for srctest and wwwtest now. I can not find any of the old false positives. Thank you.
no problem, please let me know if you find others :)
I think this is a false positive: