outbrain / fwumious_wabbit

Fwumious Wabbit, fast on-line machine learning toolkit written in Rust
Other
133 stars 19 forks source link

Cache-only mode fix #87

Closed SkBlaz closed 1 year ago

SkBlaz commented 1 year ago

It appears cache-only still did the forward pass. Pinned arguments to the same value so that

./target/release/fw --build_cache_only --data ~/somerandomdata/data.vw.gz

works as one would expect (=faster + cache only is produced).

SkBlaz commented 1 year ago

@SkBlaz sorry but lacking context here. What is the fix? What was the problem? I only see a print added, and a name change for a cmd argument.

@bbenshalom current binary appears to not produce cache only if using just the existing flag - it still does the forward pass it seems. By pinning long/short cmd names I could get it to conduct just caching by doing what one would expect (just adding --build_cache_only to cli command). Otherwise, it seems that apart from the flag, one needed -c; this is less intuitive + also it still does not seem to produce cache only

yonatankarni commented 1 year ago

@SkBlaz sorry but lacking context here. What is the fix? What was the problem? I only see a print added, and a name change for a cmd argument.

@bbenshalom current binary appears to not produce cache only if using just the existing flag - it still does the forward pass it seems. By pinning long/short cmd names I could get it to conduct just caching by doing what one would expect (just adding --build_cache_only to cli command). Otherwise, it seems that apart from the flag, one needed -c; this is less intuitive + also it still does not seem to produce cache only

@SkBlaz sorry, I also don't understand. how does this work exactly?

@SkBlaz sorry but lacking context here. What is the fix? What was the problem? I only see a print added, and a name change for a cmd argument.

@bbenshalom current binary appears to not produce cache only if using just the existing flag - it still does the forward pass it seems. By pinning long/short cmd names I could get it to conduct just caching by doing what one would expect (just adding --build_cache_only to cli command). Otherwise, it seems that apart from the flag, one needed -c; this is less intuitive + also it still does not seem to produce cache only

sorry @SkBlaz I still don't understand - how does changing the command long name affect the forward pass?

SkBlaz commented 1 year ago

@SkBlaz sorry but lacking context here. What is the fix? What was the problem? I only see a print added, and a name change for a cmd argument.

@bbenshalom current binary appears to not produce cache only if using just the existing flag - it still does the forward pass it seems. By pinning long/short cmd names I could get it to conduct just caching by doing what one would expect (just adding --build_cache_only to cli command). Otherwise, it seems that apart from the flag, one needed -c; this is less intuitive + also it still does not seem to produce cache only

@SkBlaz sorry, I also don't understand. how does this work exactly?

@SkBlaz sorry but lacking context here. What is the fix? What was the problem? I only see a print added, and a name change for a cmd argument.

@bbenshalom current binary appears to not produce cache only if using just the existing flag - it still does the forward pass it seems. By pinning long/short cmd names I could get it to conduct just caching by doing what one would expect (just adding --build_cache_only to cli command). Otherwise, it seems that apart from the flag, one needed -c; this is less intuitive + also it still does not seem to produce cache only

sorry @SkBlaz I still don't understand - how does changing the command long name affect the forward pass?

Let me give an example perhaps. Consider current main. Building cache is supposed to be done as

./fw --build_cache_only --data {SOME DATA} -c

right? Well, this (with or without -c always yields)

error: Found argument '--build_cache_only' which wasn't expected, or isn't valid in this context
    Did you mean --build_cache_without_training?

USAGE:
    fw --build_cache_without_training

Let's try with that then (even though default argument is the short one)

        .arg(Arg::with_name("build_cache_only")
             .long("build_cache_without_training")
             .value_name("arg")
             .help("Build cache file without training the first model instance")
             .takes_value(false))

Doing

./target/debug/fw --build_cache_without_training --data ~/datasets/EXAMPLE_DATA_1h/data.vw.gz -c

yields.

[2023-03-02T16:58:38Z WARN  fw::vwmap] Warning: multi-byte namespace names are not compatible with old style namespace arguments
[2023-03-02T16:58:38Z INFO  fw::vwmap] _namespace_skip_prefix set in vw_namespace_map.csv is 2
[2023-03-02T16:58:38Z INFO  fw::cache] creating cache file = /home/bskrlj/datasets/EXAMPLE_DATA_1h/data.vw.gz.fwcache
[2023-03-02T16:58:51Z INFO  fw] Elapsed: 13.56s rows: 1000000

which, given the "cache only" option should never get to this last part (Elapsed ..), but should stop as soon as cache is built.

    if cl.is_present("build_cache_without_training") {
        return build_cache_without_training(cl);
    }

Maybe I'm just missing something apparent here, but

  1. It seems it does not get to this if clause in the first place
  2. Continues to build the cache while doing the forward pass/learn loop, which is undesired (should stop immediately after cache build step, that's why we have build_cache_without_training in the first place?

Oddly enough, if I pin the arguments (long/short) to be the same, it works fine and also exits right after cache is built (as intended) + no -c needed (non-intuitive to have that if we already have --build_cache_only flag)