Mihonarium commented 3 years ago

While it's relatively easy to train the model on the Dataset-mini (even Colab allows that), it's not as easy to reproduce the paper's results with the Dataset-full. It would be great if you could publish a model trained on the full dataset.

(By the way, congratulations on the paper, and thanks for publishing the work, it's really cool!)

Mihonarium commented 3 years ago

Oh, sorry, I just saw that you actually use the mini dataset for training and the full one for a full-scale evaluation. Closing the issue

mimbres commented 3 years ago

Thanks. Yes actually the training part is same. I have a plan for colab. The g-drive (raw) files are exactly for the purpose of mounting it on colab .

Training in colab: I didn't test it but it should work. You first need to modify the config/default.yaml. The OUTPUT_ROOT_DIR and LOG_ROOT_DIR must be set to you gdrive directory. And other paths like SOURCE_ROOT etc. should be the dataset (raw) I shared. In training, It saves model checkpoint every epoch. Usually every twenty minutes or it can take longer. So if the colab was auto-shut down, you can continue training from the last checkpoint. If you meet any problem, just let me know. It will be a nice contribution.

About sharing a trained model, yes I can. The plan is to write a one page colab demo by loading it for the next update. But if you wanna early-try, here is the link.

I really welcome feedback from colab users. I feel it is the way this open project to go.

mimbres commented 3 years ago

I am wondering if it is possible to install faiss (required for constructing search engine) smoothly in colab. I've never tried it yet. It is also an important prerequisite to develop colab demo. I'll test it out a bit tonight.

[x] Installation of faiss-gpu on colab.

Mihonarium commented 3 years ago

I was able to run the training process in Colab with Miniconda, but just installing requirements without Miniconda leads to an error. #12 should fix it.

Restoring from that checkpoint doesn't work for some reason. It outputs a long list of messages like WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'v' for (root).model.div_enc.split_fc_layers.124.layer_with_weights-0.bias for all the layers, weights, etc., and this warning at the end:

WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details

mimbres commented 3 years ago

@Mihonarium Thanks for report. Yes, it seems we don't need conda for colab. Just pip install works smooth. Installation of faiss-gpu was super smooth too: !pip install faiss-gpu.

About your checkpoint loading issue, let me ask:

Just use the config/640_lamb.yaml in repo.
Did you specify config? The command should be like:
```
python run.py train -c 640_lamb  640_lamb # ignore this line..
!python run.py generate -c 640_lamb 640_lamb 101
```
BTW, just try generate command. Continuing train from the checkpoint of different type of device is weird scenario. If you send me your notebook, I'll look at it tomorrow.

Mihonarium commented 3 years ago

Yes, I did specify the config.

What's even more strange, the issue with a lot of warnings appears only with run.py train and doesn't appear for generate.

The notebook: https://gist.github.com/Mihonarium/e3fd355cb560b82373fd2186139f1bc2 (the last cells show that generate and training from scratch work).

mimbres commented 3 years ago

@Mihonarium Oh it is an expected behavior as I wrote it above. The checkpoint file contains optimzer's states info which is GPU device dependent. So, if you wanna continue train using my checkpoint as an initial parameter, it's possible but I didn't consider such use. It requires to load model without connecting optimizer first (as in generate). Then initialize optimizer and start training.

mimbres commented 3 years ago

@Mihonarium About training from scratch error: First, for P100 GPU, I recommend

BSZ:
    TR_BATCH_SZ : 320
        # Training batch size N must be EVEN number.
    TR_N_ANCHOR : 160

You didn't get out of memory error though. But this is not related with your issue. I am now checking CPU info of colab. In config, try:

DEVICE:
    CPU_N_WORKERS : 4 # 4 for minimal system. 8 is recommended.
    CPU_MAX_QUEUE : 10 # 10 for minimal system. 20 is recommended.

It depends on how many threads the system can handle. I will run it tomorrow.

Mihonarium commented 3 years ago

it is an expected behavior as I wrote it above. The checkpoint file contains optimzer's states info which is GPU device dependent.

Got it, makes sense. Thanks!

Training from scratch didn't give any errors, I interrupted it. I included it to show that errors are from the checkpoints load (I didn't know it was the expected behavior) and not from something else. You're right though, I would probably get an out of memory error if trained for longer. I was actually able to train the model successfully with a batch size of 320.

Mihonarium commented 3 years ago

Got unsupported operand type(s) for +: 'PosixPath' and 'str' from line 306 of dataset.py when tried to generate from a custom source

mimbres commented 3 years ago

@Mihonarium Solved by removing pathlib for argin. Also fixed same issue for --output option.

TheMightyRaider commented 3 years ago

@mimbres @Mihonarium Is it possible for you guys to share the trained model, It's quite hard to train with 320 as batch size? :crossed_fingers:

Mihonarium commented 3 years ago

@TheMightyRaider the trained model is available here

TheMightyRaider commented 3 years ago

Thanks! @Mihonarium

haha010508 commented 1 year ago

i use the pretrained model, and same database(Dataset-mini), for evalue step, but i got very poor result, i want to know: why? this is my code ` CUDA_VISIBLE_DEVICES=1 python run.py evaluate 640_lamb 101.index -c 640_lamb cli: Configuration from ./config/640_lamb.yaml Load 29,500 items from ./logs/emb/640_lamb/101.index/query.mm. Load 29,500 items from ./logs/emb/640_lamb/101.index/db.mm. Load 581,922 items from ./logs/emb/640_lamb/101.index/dummy_db.mm. Creating index: ivfpq Copy index to GPU. Training index... Elapsed time: 23.07 seconds. 581922 items from dummy DB 29500 items from reference DB Added total 611422 items to DB. 2.25 sec. Created fake_recon_index, total 611422 items. 0.04 sec. test_id: icassp, n_test: 2000 ========= Top1 hit rate (%) of segment-level search ========= ---------------- Query length ---------------- segments 1 3 5 9 11 19 seconds (1s) (2s) (3s) (5s) (6s) (10s)

Top1 exact 3.75 5.90 6.45 7.25 7.25 7.80 Top1 near 4.00 6.15 6.70 7.30 7.30 7.80 Top3 exact 4.40 7.00 7.85 8.60 8.45 8.95 Top10 exact 5.40 8.35 9.40 10.90 11.15 10.90

average search + evaluation time 7.25 ms/query Saved test_ids and raw score to ./logs/emb/640_lamb/101.index/. `
if i need retrain?

mimbres / neural-audio-fp

Pretrained model #10

Top1 exact 3.75 5.90 6.45 7.25 7.25 7.80 Top1 near 4.00 6.15 6.70 7.30 7.30 7.80 Top3 exact 4.40 7.00 7.85 8.60 8.45 8.95 Top10 exact 5.40 8.35 9.40 10.90 11.15 10.90