Evaluation with custom dataset

mhmd-mst commented 1 year ago

Hello, I am using a pretrained model that you commented in one of your issues, and I want to use it for my own dataset, I modified the dataset class:

def get_custom_db_ds(self, source_root_dir):
    """ Construc DB (or query) from custom source files. """

    query_path = sorted(
        glob.glob(source_root_dir + 'query/*.wav'))
    db_path = sorted(
        glob.glob(source_root_dir + 'db/*.wav'))

    _ts_n_anchor = self.ts_batch_sz # Only anchors...

    query_ds = genUnbalSequence(
        db_path,
        self.ts_batch_sz,
        _ts_n_anchor,
        self.dur,
        self.hop,
        self.fs,
        shuffle=False,
        random_offset_anchor=False,
        drop_the_last_non_full_batch=False) # No augmentations, No drop-samples.
    db_ds = genUnbalSequence(
        query_path,
        self.ts_batch_sz,
        _ts_n_anchor,
        self.dur,
        self.hop,
        self.fs,
        shuffle=False,
        random_offset_anchor=False,
        drop_the_last_non_full_batch=False)
    return query_ds, db_ds

and the get_data_source function in generate.py:

def get_data_source(cfg, source_root_dir, skip_dummy):
    dataset = Dataset(cfg)
    ds = dict()
    if source_root_dir:
        ds['custom_source'] = dataset.get_custom_db_ds(source_root_dir)
    else:
        if skip_dummy:
            tf.print("Excluding \033[33m'dummy_db'\033[0m from source.")
            pass
        else:
            ds['dummy_db'] = dataset.get_test_dummy_db_ds()

        if dataset.datasel_test_query_db in ['unseen_icassp', 'unseen_syn']:
            ds['query'], ds['db'] = dataset.get_test_query_db_ds()
        elif dataset.datasel_test_query_db == 'custom_dataset':
            ds['query'], ds['db'] = dataset.get_custom_db_ds(cfg['DIR']['SOURCE_ROOT_DIR'])
        else:
            raise ValueError(dataset.datasel_test_query_db)

    tf.print(f'\x1b[1;32mData source: {ds.keys()}\x1b[0m',
             f'{dataset.datasel_test_query_db}')
    return ds

I am using a sample of 3 db audios files and 2 query audio files.

Generating the fingerprint gave a warning that the size does not match also note that I am using skip dummy since I am using my own dataset. I want to ask if this is fine.
also regarding search and evaluation, a part of the code depend on the dummy_db like training the faiss index, how should I adapt it for my case.
The test_seq_len argument is regarding how many segment from the query we should consider to match against the db?
can this model help me get from which timestamp in the top1 detected db audio does a start from.
If a query contains parts from two database audios is it possible to get a probability score so that I can condition on a threshold to detect if it comes from more than one audio db or not.

Thank in advance, and thanks for you greate work

mimbres commented 1 year ago

@mhmd-mst Hi, I'll answer briefly.

It would be fine, perhaps the warning was from the different batch size for inference .
3 audio files as DB? It is extremely small to construct (and train) faiss IVFPQ index. One option is to use L2 index (selectable in eval_faiss.py), which doesn't require index training. Second option is to use our dataset for training the index. As long as your audio is music, it will work. Third option is just not to use faiss. You can directly calculate distance between the fingerprints as in (https://github.com/mimbres/neural-audio-fp/blob/058d812df3787a7e000c6f595e200fd2e15ee348/model/utils/mini_search_subroutines.py#L123)
Yes
Yes, if you have another table that maps faiss index(or any index e.g. list, dict) to file name and timestamp within the file. So, you have to build it by yourself.
Yes I think it makes sense, and will be useful in tuning system in scalable/parallel search scenarios. But it has not been implemented in this project.

mhmd-mst commented 1 year ago

Thanks for you reply,

regarding you first reply should the number of queries and db be the same inorder to have a matched batch size
for the custom case, my dataset contains audio from clips so it is not recessarily music only, can I get rid of any dummy_db part included in the code? and in this case should i remove this part in eval_faiss : fake_recon_index, index_shape = load_memmap_data( emb_dummy_dir, 'dummy_db', append_extra_length=query_shape[0], display=False) fake_recon_index[dummy_db_shape[0]:dummy_db_shape[0] + query_shape[0], :] = db[:, :] also the append_extra_length = query_shape[0] would cause a problem if batchsize of db and query doesnt match right?
Also I have a bigger database but just for debugging purposes im using a small one, if I use my full db, should I use it to train the faiss index?
regarding your forth reply, after I evaluate the fingerprints and I get an index for a query, I can use this index to get the name and then should i match like the first segment from query to all segments of the top1 db and use the matched segment to find the timestamp?

Thanks again for you help.

mimbres commented 1 year ago

No, batch-size is just about the in/output shape of our NN fingerprinter. The batch size can be set depending on your machine type (GPU/CPU) and its memory. It is just about speed of inference (generating fingerprints). The test batch size will not affect the search method and performance. Also there is no requirement to use the same batch size for queries and DB generation. Fingerprint dimension (d) should match though.
Yes if you don't have dummies, then
```
fake_recon_index = db
```
The query_shape[0], dummy_db_shape[0], or data_shape[0] are all about the number of fingerprints you have. In other words, it is data length, not a batch size.

FYI, #17 explains the search process for sequence query.

For debugging with small query and DB, my suggestion: 1) As explained in #17 and the mini_search_subroutine, you can directly perform the search and calculate the distance or likelihood between qeuries and db items. 2) Moving on, you try faiss search using L2 index. Its output should exactly match the result of 1). 3) Finally, you try IVFPQ index. It requires training of index that takes several minutes. IVFPQ uses quantized representation, and it is a data-dependent trainable quatization technique. Usually there is some amount of performance drop when you use faiss, compared to 1) and 2). It is a matter of trade-off between speed and accuracy by practice. This implies that if your data has proper amount (I can't figure it out in number, because it also depends on the parameters of faiss setup), you won't need to use our dummy data. If your data is small and is music, you can use the dummy data. If your data is small and not music (e.g. speech), you need some replacement of dummy data from other sources for training index. After the index is constructed, you can just remove the dummies and it is also possible to store the trained index. I cannot give you further detail information on this, but the issue board of faiss project will help.
faiss library itself doesn't have such an additional table for storing names in text. faiss is only specialized to vector-search. Depending on the scale of your search DB, the table can be implemented as an additional python dict() or more conventional DBs such as mySQL and elastic search.

mhmd-mst commented 1 year ago

In the code I noticed if for example the query set contains 3 audio files, the query.mm file will have fp of all of them, if we have query_shape of all fps as (9,128) such that 3 fps for each audio, if I choose test_seq_len as '3' , the np.dot() used in the enumerate(candidates) loop will not be as meaningful as the case in #17 because now the fps are part of two queries am I right in this interpretation? Also this part pred_ids = candidates[np.argsort(-_scores)[:10]] give the top 10 most likely ids for the first fp in the query considered right?

mimbres commented 1 year ago

@mhmd-mst 3 audio files, and 3 fps per each file. Your fps.shape = (9, 128). So, fps[0:3] is your first file.

test_ids = [0, 3, 6] # start_id of query sequence from each file

With this test_ids, the enumerate(cadidates) loop is still important, unless sequence length is 1. The loop averages likelihood between the sequences over highly plausible sequential combination candidates ([1,2,3] vs. [2,3,4] vs... but not [3,2,1] nor [5,1,3]). By the way, k_probe by default value 20 is also too large for your setup. It means 1s-segment-level-search result for each fp, up to rank 20 for each fp (e.g. fps[0], not a sequence-level search) will be considered. I am not sure if this would cause errors though.

Also this part pred_ids = candidates[np.argsort(-_scores)[:10]] give the top 10 most likely ids for the first fp in the query considered right?

Yes. If you used test_ids =[0,3,6] with seq_len=3,and in the first iteration of loop, pred_ids here will contain top10 3s-sequence-search-result for the first query sequence (fps[0,1,2]) starting with test_id=0.

mhmd-mst commented 1 year ago

Hello, Regarding the fingerprint of a segment itself, how unique is it, like if i have a very huge db, will there be a segment with same fingerprint as another segment

mimbres commented 1 year ago

It depends on the dimension of fingerprint and how to define the same segment. In the training process, all segments except themselves are expressed differently, and the larger the dimension, the more unique it becomes.

mimbres / neural-audio-fp

Evaluation with custom dataset #38