magnusmanske / petscan_rs

The repo for the PetScan tool
https://petscan.wmflabs.org/
GNU General Public License v3.0
44 stars 10 forks source link

Inconsistent results over a short period of time using an identical CirrusSearch "search filter" parameter #157

Closed lnoss closed 8 months ago

lnoss commented 8 months ago

The project rocks solid since years, but I met a problem recently. This request is giving different results each time: https://petscan.wmflabs.org/?edits%5Bflagged%5D=both&templates_any=&pagepile=&active_tab=tab_output&outlinks_yes=&manual_list=&search_wiki=frwiki&cb_labels_any_l=1&ores_prob_from=&labels_yes=&show_soft_redirects=both&edits%5Bbots%5D=both&labels_any=&wikidata_prop_item_use=&language=fr&sitelinks_yes=&langs_labels_any=&page_image=any&ns%5B0%5D=1&templates_no=&wikidata_label_language=&maxlinks=&smaller=&manual_list_wiki=&outlinks_no=&output_limit=10&links_to_all=C%C3%A9sar+du+meilleur+film&wikidata_source_sites=&cb_labels_no_l=1&search_max_results=500&sitelinks_no=&project=wikipedia&negcats=&langs_labels_no=&search_filter=insource%3A%2F%5C%5B%5C%5BC%C3%A9sar+du+meilleur+film%28%3F%3A%5C%7C%5B%5E%7C%5C%5D%5D*%29%3F%5C%5D%5C%5D%2Fi&rxp_filter=&links_to_no=&common_wiki_other=&after=&wikidata_item=no&cb_labels_yes_l=1&interface_language=en&doit=

Tried to check the source code, but I am not expert with Rust. Not funny note: PetScan just crashed when I was previewing this issue before submitting.

Reqwest(reqwest::Error 
{ 
kind: Request, 
url: Url { scheme: "https", cannot_be_a_base: false, username: "", 
password: None, 
host: Some(Domain("fr.wikipedia.org")), 
port: None, 
path: "/w/api.php", 
query: Some("action=query&siprop=general%7Cnamespaces%7Cnamespacealiases%7Clibraries%7Cextensions%7Cstatistics&format=json&meta=siteinfo"), 
fragment: None
}, source: hyper::Error(Io, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) })
magnusmanske commented 8 months ago

The internal sort order is unstable (sort occurs at the end), and you are limiting the output to 10 results. Therefore, you get 10 different results each time, essentially a sample.

lnoss commented 8 months ago

I limited the output for the reporting. Without the limit, the returned set is unstable and different each time.