rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.55k stars 2.38k forks source link

Fuzzy queries cause excessive network requests. #11934

Closed ehuss closed 1 year ago

ehuss commented 1 year ago

When cargo can't find a crate name in a registry, it will also query every permutation of - and _ replacements (up to 1024 times) to see if there is a different match. When this code was written, it was assumed it would just be a simple query to git, but now that the sparse index does a network round trip for each one, it probably isn't a good assumption that doing 1024 is OK.

I think at a minimum it should lower the limit significantly.

Another option is to only do up to 3 queries, the original, all underscore, and all dashes.

epage commented 1 year ago

Another option is to only do up to 3 queries, the original, all underscore, and all dashes.

I think this is personally reasonable. I was surprised when I saw what lengths it went through.

ehuss commented 1 year ago

Here's a list of names on crates.io that use a mix of dash and underscore. There are 623 entries. That's not as many as I feared.

Click to expand list of 623 entries ``` a-star_traitbased a42-__- advancedresearch-agent_safety_layers advancedresearch-asi_core0 advancedresearch-error_predictive_learning advancedresearch-graph_builder advancedresearch-higher_order_core advancedresearch-higher_order_point advancedresearch-max_tree advancedresearch-nano_ecs advancedresearch-path_iter advancedresearch-rigid_body advancedresearch-tree_mem_sort advancedresearch-utility_programming aflak_imgui-glium-renderer aflak_imgui-sys air-interpreter-wasm_branch_release air-interpreter-wasm_ci_fix_version_increment air-interpreter-wasm_crate_release air-interpreter-wasm_fix air-interpreter-wasm_fold_iterable air-interpreter-wasm_get_rid_current_peer_id air-interpreter-wasm_jvalue_flatten air-interpreter-wasm_logger-i32 air-interpreter-wasm_new_it_types air-interpreter-wasm_new_stepper_outcome air-interpreter-wasm_prerelease air-interpreter-wasm_security_tetraplets_reborn air-interpreter-wasm_update_example air-interpreter-wasm_update_images air-interpreter-wasm_wasm_log air-interpreter-wasm_wit allegro_acodec-sys allegro_audio-sys allegro_color-sys allegro_dialog-sys allegro_font-sys allegro_image-sys allegro_memfile-sys allegro_primitives-sys allegro_ttf-sys android_liblog-sys android_log-sys android_looper-sys android_sensor-sys aorist_extendr-api aorist_extendr-engine aorist_extendr-macros asmkit-x86_64 asteracea_proc-macro-definitions async_ach-cell async_ach-mpmc async_ach-notify async_ach-pubsub async_ach-ring async_ach-spsc async_ach-waker async_ach-watch async_fn-proc_macros aux_ulib-sys aws-codebuild-status_aws aws-codebuild-status_derive aws-codebuild-status_server aws-codebuild-status_terminal aws-codebuild-status_web azul-wr_malloc_size_of bachue-auto_impl bech32-no_std bird_tool_utils-man black_marlin-compiler black_marlin-macros blake2-rfc_bellman_edition blip_buf-sys blizzard-engine_derive bls_sigs_ref-rs bootloader-x86_64-bios-boot-sector bootloader-x86_64-bios-common bootloader-x86_64-bios-stage-2 bootloader-x86_64-bios-stage-3 bootloader-x86_64-bios-stage-4 bootloader-x86_64-common bootloader-x86_64-uefi bsd_auth-sys byte-strings-proc_macros cargo-arch_test cargo-generate-license_impl-license cargo-wasi-exe-x86_64-apple-darwin cargo-wasi-exe-x86_64-pc-windows-msvc cargo-wasi-exe-x86_64-unknown-linux-musl cargo_crates-io_docs-rs_test cargo_crates-io_docs-rs_test2 clap_derive-v3 clean-node_modules cloudflare-dns-updater_rs const_env--value const_env_impl--value core_rustc-serialize crate-_-name crates_io-duplicates create-rust-app_cli crevice_notan-derive crokey-proc_macros crossbeam-skiplist_piedb cursive_buffered_backend-git-branchless custom-try-proc_macros d-_-b d3d10_1-sys dash-and_underscore defile-proc_macros deno-x86_64-apple-darwin deno-x86_64-pc-windows-msvc deno-x86_64-unknown-linux-gnu displaydoc-lite-proc_macros django-query_derive do-not-use-this-hello_cratesio download_rs-silent dscfg-cached_file_storage dusk-bls12_381 dusk-bls12_381-sign dyn_safe-proc_macros egui-latest_wgpu_backend encoding-next_index_tests entity_data-macros erlang_nif-sys eskom_se_push-api event_store-core evercrypt_tiny-sys evr_vista-sys exonum_librocksdb-sys exonum_libsodium-sys ext-trait-proc_macros fast-rustc-ap-rustc_ast fast-rustc-ap-rustc_ast_pretty fast-rustc-ap-rustc_attr fast-rustc-ap-rustc_data_structures fast-rustc-ap-rustc_errors fast-rustc-ap-rustc_feature fast-rustc-ap-rustc_fs_util fast-rustc-ap-rustc_index fast-rustc-ap-rustc_lexer fast-rustc-ap-rustc_macros fast-rustc-ap-rustc_parse fast-rustc-ap-rustc_session fast-rustc-ap-rustc_span fast-rustc-ap-rustc_target fawkes-crypto-bellman_ce fawkes-crypto-pairing_ce fawkes-crypto_derive ff-uint_derive ff_derive-zeroize fix-hidden-lifetime-bug-proc_macros flexible-locks_derive foley-cargoexample-guessing_game fruit-salad_proc-macro-definitions fuel-pest_derive fuel-pest_generator fuel-pest_grammars fuel-pest_meta fullcodec-bls12_381 function_name-proc-macro future-parking_lot gdnative-visual_script geng-debug_overlay get_if_addrs-sys ggez-assets_manager ghosts-proc_macros gluon_c-api gluon_language-server gmt_dos-actors gmt_dos-clients_arrow gmt_dos-clients_crseo gmt_dos-clients_domeseeing gmt_dos-clients_io gmt_dos-clients_mount gmt_m1-ctrl gmt_m1-ctrl_actuators gmt_m1-ctrl_center-actuators gmt_m1-ctrl_hardpoints-dynamics gmt_m1-ctrl_outer-actuators gmt_mount-ctrl gmt_mount-ctrl_sampling1000-damping002 gmt_mount-ctrl_sampling1000-damping002_ze00 gmt_mount-ctrl_sampling1000-damping002_ze30 gmt_mount-ctrl_sampling1000-damping002_ze30-500hz gmt_mount-ctrl_sampling1000-damping002_ze60 gmt_mount-ctrl_sampling8000-damping0005 gmt_mount-ctrl_sampling8000-damping002 gmt_mount-ctrl_sampling8000-damping002_ze00 gmt_mount-ctrl_sampling8000-damping002_ze30 gmt_mount-ctrl_sampling8000-damping002_ze60 google-accessapproval1_beta1 google-accessapproval1_beta1-cli google-accesscontextmanager1_beta google-accesscontextmanager1_beta-cli google-adexchangebuyer2_v2_beta1 google-adexchangebuyer2_v2_beta1-cli google-admin1_directory google-admin1_directory-cli google-admin1_reports google-admin1_reports-cli google-admin2_email_migration google-alertcenter1_beta1 google-alertcenter1_beta1-cli google-analyticsadmin1_alpha google-analyticsadmin1_alpha-cli google-analyticsdata1_beta google-analyticsdata1_beta-cli google-appengine1_beta4 google-appengine1_beta4-cli google-appengine1_beta5 google-appengine1_beta5-cli google-area120tables1_alpha1 google-area120tables1_alpha1-cli google-artifactregistry1_beta1 google-artifactregistry1_beta1-cli google-autoscaler1_beta2 google-autoscaler1_beta2-cli google-bigqueryconnection1_beta1 google-bigqueryconnection1_beta1-cli google-billingbudgets1_beta1 google-billingbudgets1_beta1-cli google-binaryauthorization1_beta1 google-binaryauthorization1_beta1-cli google-cloudasset1_beta1 google-cloudasset1_beta1-cli google-clouderrorreporting1_beta1 google-clouderrorreporting1_beta1-cli google-cloudkms1_beta1 google-cloudkms1_beta1-cli google-cloudmonitoring2_beta2 google-cloudmonitoring2_beta2-cli google-cloudprivatecatalog1_beta1 google-cloudprivatecatalog1_beta1-cli google-cloudprivatecatalogproducer1_beta1 google-cloudprivatecatalogproducer1_beta1-cli google-cloudresourcemanager1_beta1 google-cloudresourcemanager1_beta1-cli google-cloudscheduler1_beta1 google-cloudscheduler1_beta1-cli google-cloudsupport2_beta google-cloudsupport2_beta-cli google-cloudtasks2_beta2 google-cloudtasks2_beta2-cli google-cloudtasks2_beta3 google-cloudtasks2_beta3-cli google-clouduseraccountsvm_beta google-clouduseraccountsvm_beta-cli google-commentanalyzer1_alpha1 google-commentanalyzer1_alpha1-cli google-container1_beta1 google-containeranalysis1_beta1 google-containeranalysis1_beta1-cli google-content2_sandbox google-content2_sandbox-cli google-datacatalog1_beta1 google-datacatalog1_beta1-cli google-dataflow1_b4 google-datafusion1_beta1 google-datafusion1_beta1-cli google-datalabeling1_beta1 google-datalabeling1_beta1-cli google-datastore1_beta2 google-datastore1_beta3 google-datastore1_beta3-cli google-deploymentmanager2_beta1 google-deploymentmanager2_beta2 google-deploymentmanager2_beta2-cli google-dialogflow2_beta1 google-dialogflow2_beta1-cli google-dlp2_beta1 google-dlp2_beta1-cli google-dns1_beta1 google-documentai1_beta2 google-documentai1_beta2-cli google-domains1_beta1 google-domains1_beta1-cli google-factchecktools1_alpha1 google-factchecktools1_alpha1-cli google-fcmdata1_beta1 google-fcmdata1_beta1-cli google-file1_beta1 google-file1_beta1-cli google-firebase1_beta1 google-firebase1_beta1-cli google-firebaseappcheck1_beta google-firebaseappcheck1_beta-cli google-firebasedatabase1_beta google-firebasedatabase1_beta-cli google-firebasehosting1_beta1 google-firebasehosting1_beta1-cli google-firebasestorage1_beta google-firebasestorage1_beta-cli google-firestore1_beta1 google-firestore1_beta1-cli google-freebase1_sandbox google-gamesconfiguration1_configuration google-gamesconfiguration1_configuration-cli google-gamesmanagement1_management google-gamesmanagement1_management-cli google-gan1_beta1 google-gan1_beta1-cli google-genomics1_beta2 google-gmailpostmastertools1_beta1 google-gmailpostmastertools1_beta1-cli google-healthcare1_beta1 google-healthcare1_beta1-cli google-iap1_beta1 google-iap1_beta1-cli google-ideahub1_beta google-ideahub1_beta-cli google-language1_beta1 google-language1_beta1-cli google-lifesciences2_beta google-lifesciences2_beta-cli google-logging1_beta3 google-logging2_beta1 google-logging2_beta1-cli google-manager1_beta2 google-manager1_beta2-cli google-memcache1_beta2 google-memcache1_beta2-cli google-metastore1_beta google-metastore1_beta-cli google-ml1_beta1 google-ml1_beta1-cli google-networkconnectivity1_alpha1 google-networkconnectivity1_alpha1-cli google-oauth2_v2 google-oslogin1_beta google-oslogin1_beta-cli google-privateca1_beta1 google-privateca1_beta1-cli google-prod_tt_sasportal1_alpha1 google-prod_tt_sasportal1_alpha1-cli google-proximitybeacon1_beta1 google-proximitybeacon1_beta1-cli google-pubsub1_beta2 google-pubsub1_beta2-cli google-recommendationengine1_beta1 google-recommendationengine1_beta1-cli google-recommender1_beta1 google-recommender1_beta1-cli google-replicapool1_beta2 google-replicapool1_beta2-cli google-replicapoolupdater1_beta1 google-replicapoolupdater1_beta1-cli google-reseller1_sandbox google-reseller1_sandbox-cli google-resourceviews1_beta2 google-resourceviews1_beta2-cli google-runtimeconfig1_beta1 google-runtimeconfig1_beta1-cli google-sasportal1_alpha1 google-sasportal1_alpha1-cli google-secretmanager1_beta1 google-secretmanager1_beta1-cli google-servicedirectory1_beta1 google-servicedirectory1_beta1-cli google-spectrum1_explorer google-spectrum1_explorer-cli google-speech1_beta1 google-speech1_beta1-cli google-sql1_beta4 google-sql1_beta4-cli google-sqladmin1_beta4 google-sqladmin1_beta4-cli google-taskqueue1_beta2 google-taskqueue1_beta2-cli google-tpu1_alpha1 google-tpu1_alpha1-cli google-transcoder1_beta1 google-transcoder1_beta1-cli google-videointelligence1_beta1 google-videointelligence1_beta1-cli gothack-future-parking_lot gps-gunnlaug_18 graphannis-malloc_size_of graphannis-malloc_size_of_derive gt-bijective_connection_graph guessing_game-79888c11-a5a6-4548-a26a-c85ea39d0499 guessing_game-panzerstadt guessing_game-xfossdotcom hkalbasi-rustc-ap-rustc_abi hkalbasi-rustc-ap-rustc_data_structures hkalbasi-rustc-ap-rustc_graphviz hkalbasi-rustc-ap-rustc_index hkalbasi-rustc-ap-rustc_macros hkalbasi-rustc-ap-rustc_serialize hvcg_governance_openapi_catholic-polity ignite-rs_derive inside-vm_arch_support ip_network_table-deps-treebitmap istherepie-hello_world iter-identify_first_last jelly-mem_access jh-x86_64 jit_rs-sys jpark011-guessing_game jsonpath_lib-fl koibumi-unifont_jp konektor_db-sys lattice_qcd_rs-procedural_macro lazy-regex-proc_macros lending-iterator-proc_macros lens-rs_derive lens-rs_generator lettre-openssl111_email lex-map_editor_std lib_remotebuild-rs libaeron_driver-sys libpg_query-sys libpg_query2-sys libzfs_core-sys lim1988_cargo-demo limn-text_layout linux_input-sys liquidcrystal_i2c-rs login_cap-sys lpc177x_8x-hal lukas_zech_software_hello-world lz4_flex-util m2-ctrl_asm m2-ctrl_fsm macro_rules_attribute-proc_macro matio-rs_derive mbnapi_uuid-sys mdbook-pdf-headless_chrome merge_derive-hashmap metrics-exporter-async_std metrics-exporter-http-async_std mf_vista-sys mfplat_vista-sys mincore_downlevel-sys mini_paste-proc_macro minimp3_ex-sys minimult_cortex-m mock-it_codegen mount-ctrl_sampling1000-damping002 mount-ctrl_sampling8000-damping0005 mproxy-socket_dispatch msiz_rustc-ap-arena msiz_rustc-ap-graphviz msiz_rustc-ap-rustc_data_structures msiz_rustc-ap-rustc_errors msiz_rustc-ap-rustc_lexer msiz_rustc-ap-rustc_macros msiz_rustc-ap-rustc_target msiz_rustc-ap-serialize msiz_rustc-ap-syntax msiz_rustc-ap-syntax_pos msv1_0-sys my_crate_72a9eb54-5002-4540-a6e6-f2620f51b550 mycodee-project_manager nacos-api_macro nameless-clap_derive nameless-clap_generate nameless-clap_up navinfo-add_two next-gen-proc_macros next-gen_proc-macro nix-ptsname_r-shim nnye-user_server-gateway-business noise_search_deps_librocksdb-sys nougat-proc_macros nt-list_macros ntstc_libcmt-sys ntstc_msvcrt-sys object_store-fork optimisticpeach-opengles_graphics paho_mqtt_c-sys paste_rs-cli patch-test-crate_121213131213131 persian-rug_derive piston-ai_behavior piston-button_controller piston-button_tracker piston-dyon_interactive piston-fake_dpi piston-gfx_texture piston-graphics_api_version piston-history_tree piston-shaders_graphics2d piston-split_controller piston-timer_controller piston2d-deform_grid piston2d-drag_controller piston2d-gfx_graphics piston2d-glium_graphics piston2d-graphics_tree piston2d-opengl_graphics piston2d-scroll_controller piston2d-touch_visualizer piston2d-wgpu_graphics piston3d-gfx_voxel pistoncore-event_loop pistoncore-glfw_window pistoncore-glutin_window pistoncore-sdl2_window pistoncore-winit_window plonk-bls12_381 pocket_prover-derive pocket_prover-set protoc-bin-vendored-linux-aarch_64 protoc-bin-vendored-linux-ppcle_64 protoc-bin-vendored-linux-x86_32 protoc-bin-vendored-linux-x86_64 protoc-bin-vendored-macos-x86_64 ptr_eq-macros qemu-x86_64 qpid_proton-sys quine-mc_cluskey r2d2-influx_db_client ra_ap_la-arena ra_ap_lsp-server ra_ap_proc-macro-srv-cli ra_ap_rust-analyzer ra_ap_vfs-notify rest-client_codegen retryable-proc_macros rhizome_proc-macro-definitions rime_api-sys rpi_ws281x-c rpi_ws281x-sys rs-data-formats_derive rs-odbc_derive rs-tiled_json rsudo-rs_support rumblebars-rustlex_codegen rust-riemann_health rust_sodium-sys rust_sodium_holochain_fork-sys rustc-ap-proc_macro rustc-ap-rustc_arena rustc-ap-rustc_ast rustc-ap-rustc_ast_passes rustc-ap-rustc_ast_pretty rustc-ap-rustc_attr rustc-ap-rustc_cratesio_shim rustc-ap-rustc_data_structures rustc-ap-rustc_error_codes rustc-ap-rustc_errors rustc-ap-rustc_expand rustc-ap-rustc_feature rustc-ap-rustc_fs_util rustc-ap-rustc_graphviz rustc-ap-rustc_index rustc-ap-rustc_lexer rustc-ap-rustc_lint_defs rustc-ap-rustc_macros rustc-ap-rustc_parse rustc-ap-rustc_serialize rustc-ap-rustc_session rustc-ap-rustc_span rustc-ap-rustc_target rustc-ap-syntax_pos rustc_codegen_spirv-types rustdt-json_rpc rustfmt-config_proc_macro rv32m1_ri5cy-hal rv32m1_ri5cy-pac safer_ffi-proc_macro sapling-crypto_ce screenruster-saver-laughing_man sea-strum_macros secret-keeper-test_util serde_sane-rs server-sent_events simple-tlv_derive simple_predicates-rs simple_tables-core simple_tables-derive solana_libra_crypto-derive solana_libra_grpcio-client soroban-wasmi_core sqlx-core_wasi sqlx-macros_wasi sqlx-rt_wasi stm32g473-hal_oppe stream-tls-client_hello-detector struct_deser-derive sudo_plugin-sys sugiura-hiromichi_dot sugiura-hiromichi_gc sugiura-hiromichi_mylibrary sugiura-hiromichi_tp tamasfe-schemars_derive tdameritrade_rust-async thiserror_core2-impl tibco_ems-sys tls-client_hello-parser to-syn-value_derive tokio-postgres_wasi tokio-stream_wasi tokio-tungstenite_wasi tokio-util_wasi tokio_modbus-winets toml-query_derive toy-arms_derive twain_32-sys unic-ucd-name_aliases update-sync_derive user_doc-doc_data user_doc-doc_proc_macro user_doc-tests va_list-helper va_list-rs va_list-test vds_uuid-sys vss_uuid-sys web-view_suppress webview-sys_suppress winapi-x86_64-pc-windows-gnu with_builtin_macros-proc_macros with_locals-proc_macros wlc-sys-with_elogind wlc-with_elogind ws2_32-sys wsbapp_uuid-sys x86_64-linux-nolibc x86_64-xsave xinput9_1_0-sys xmltree-parse_with_config zenoh-link-unixsock_stream ```
arlosi commented 1 year ago

I did some more analysis on the names on crates.io. The currently limit of 1024 is definitely excessive, as nothing uses more than 8 dashes/underscores.

We'd need to set the limit to at least 16 in order to have fewer misses than the suggested algorithm of (original, underscores, dashes).

Request limit Potentially missed crates
1 68282
2 21214
4 4437
8 897
16 177
32 37
64 12
128 2
256 0
est31 commented 1 year ago

cc #2775 and https://internals.rust-lang.org/t/pre-rfc-unify-dashes-and-underscores-on-crates-io/13216

est31 commented 1 year ago

What about doing the requests in parallel? This is also how cargo update works, and a cargo update to a normal-sized repository will create requests to a couple hundred crates, so number wise it's not that excessive (cat Cargo.lock | rg "^name =" | sort | uniq | wc -l gives me 279 for the cargo repo for example). The largest number of _'s is 8 which gives 256 many requests. The problem might be the high ratio of 404s returned though, maybe some cloud providers don't like that.

If one wants to do "escalation" then one can first make a request for all names with 1 bit flipped, then wait for the answer, then if nothing is found make another request for all names with 2 bits flipped, etc, with increasing numbers of bits flipped. This is the binomial coefficients game so it decreases again after you reach the half, so maybe at that point one can make one last request with all the bit flippings exceeding n/2.

arlosi commented 1 year ago

What about doing the requests in parallel?

Cargo does issue the requests in parallel. The motivation for investigating this was the high number of 404s that infra was seeing in logs.

est31 commented 1 year ago

@arlosi I see. Is there some public discussion about the 404 issue?

I suppose if only 623 entries are affected then it's okay to cut some corners and emit suggestion-free diagnostics in those cases, in all other cases the - and _ only queries will show up the right result.

Ideally one would have used the migration to http indices to migrate to using the canonicalized name (as suggested in 2020), but I guess now it's too late for that.

arlosi commented 1 year ago

Is there some public discussion about the 404 issue?

Yes, on the crates-io zulip.

Eh2406 commented 1 year ago

Sparse registries have the same format as a checkout of the git index. You can transition by checking out the git index and running a static file server in the resulting folder. That transition path was important enough to me to justify not fixing known problems with the index format. Feel free to blame me for that decision, especially as the decision was made knowingly.

Another decision that complicates things is that alternative registries are allowed to have multiple packages whose names only differ by -/_. Thanks to package renaming, you can even use both packages in the same project.

Thinking about it more, the currently stable implementation tries all kinds of files and returns the first one it finds. (All kinds of files include ridiculously incorrect things like a_/b-/a-b_foo as fixed in #11936.) It then reads the available packages from that first file. Rows whose name matches the value requested by the dependency get tried by the resolver, and if there are none the names that do not match are included as a suggestion. If I am correct, a registry could have multiple crates that differ only in -/_ with them all listed in the same file in the index. It would work correctly, ignoring the inefficient implementation, even if that one file wast at something hilariously invalid.

Assuming all of this is correct, I suggest that:

  1. Cargo be changed to only try the name as given, followed by the name canonicalized to _.
  2. The registry documentation should be updated to recommend that:
    • If the registry mainly supports cargoes after the change from 1., keeping one index file (canonicalized to _) with rows for all packages whose name canonical eyes to the same thing.
    • If the registry needs to support cargoes from before the un-canonical loop was added, then include a file per package based on its name.
    • Explain that in between "the un-canonical loop was added" and "change from 1." Cargo will get the correct answer if the new system is used but may not do so efficiently.
  3. Update the crates.io index to have both a-/b_/a-b_foo and a_/b_/a_b_foo files in both the git and sparse index.
  4. (If the crates.io git and sparse indexes are allowed to differ, and we have #10928) consider removing a-/b_/a-b_foo from the sparse and removing a_/b_/a_b_foo from the git.
weihanglo commented 1 year ago

I am going to close this since #11937 and #12083 significantly reduces network requests. If there is something still missing after #12083, feel free to reopen :)